Project

General

Profile

Actions

Bug #20670

closed

OSD suicide on msgr exceeding fd limit

Added by red ref almost 7 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On fresh 12.1.0 install (and then 12.1.0-707), with bluestore + fuse + cephfs + ec_overwrites, I got some OSD flapping under write pressure.

Stacktrace here :

    -2> <date> 7fe375393700  0 -- <osd_ip>:6821/53900 >> <ip>:0/1074864065 conn(0x7fe3bc26d800 :6821 s=STATE_OPEN pgs=1126 cs=1 l=1).process bad tag 50
    -1> <date> 7fe375393700  0 -- <osd_ip>:6821/53900 >> <ip>:0/1074864065 conn(0x7fe3a3e6d000 :6821 s=STATE_OPEN pgs=1128 cs=1 l=1).process bad tag 50
     0> <date> 7fe375393700 -1 *** Caught signal (Aborted) **
 in thread 7fe375393700 thread_name:msgr-worker-0

 ceph version 12.1.0-707-g5a197c5 (5a197c5817f591fc514f55b9929982e90d90084e) luminous (rc)
 1: (()+0x9f2f71) [0x7fe37b24ef71]
 2: (()+0xf370) [0x7fe377e8c370]
 3: (gsignal()+0x37) [0x7fe376eb61d7]
 4: (abort()+0x148) [0x7fe376eb78c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fe3777ba9d5]
 6: (()+0x5e946) [0x7fe3777b8946]
 7: (()+0x5e973) [0x7fe3777b8973]
 8: (()+0xb52c5) [0x7fe37780f2c5]
 9: (()+0x7dc5) [0x7fe377e84dc5]
 10: (clone()+0x6d) [0x7fe376f7876d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   2/ 2 mds
   1/ 5 mds_balancer
...

Seems legit suicide on OSD side refering to src/msg/async/AsyncConnection.cc.

Tried to find root cause without success.

Actions #1

Updated by Greg Farnum almost 7 years ago

  • Status changed from New to Need More Info

Can you install the debug packages to get symbols? This is pretty unintelligible without them. :(

Actions #2

Updated by red ref over 6 years ago

In the meantime, I found the root cause by using ceph-fuse in interactive mode (not using fstab) and got "Too many open files" messages. Raising limits (ulimit) solved messages and OSD's problem.

Looking /proc/<pid>/fd/, number of file descriptors is slowly reaching number of OSD's during operations (more than my previous limit).

I will get back soon anyway with stacktrace.

Actions #3

Updated by Greg Farnum over 6 years ago

  • Subject changed from OSD suicide on msgr bug (fuse client). to OSD suicide on msgr exceeding fd limit
Actions #4

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions #5

Updated by Sage Weil about 5 years ago

  • Status changed from Need More Info to Closed
Actions

Also available in: Atom PDF