Project

General

Profile

Actions

Bug #15813

closed

Assertion 0.94.5 - ceph-osd

Added by Torsten Urbas almost 8 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are running ceph 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) on a 4.5.2 kernel. Our cluster currently consists of 5 nodes, with 6 OSD's each.

Last night we have observed a single OSD (osd.11) die with an assertion:

2016-05-10 03:16:30.718936 7fa5166dc700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa5166dc700 time 2016-05-10 03:16:30.688044
common/Mutex.cc: 100: FAILED assert(r == 0)

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0xb34520]
 2: (Mutex::Lock(bool)+0x105) [0xae2395]
 3: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]
 4: (Pipe::fault(bool)+0x426) [0xc16256]
 5: (Pipe::reader()+0x3f2) [0xc1d752]
 6: (Pipe::Reader::entry()+0xd) [0xc2880d]
 7: (()+0x7474) [0x7fa54c70b474]
 8: (clone()+0x6d) [0x7fa54ac01acd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -74> 2016-05-09 11:56:56.015711 7fa54d1fc7c0  5 asok(0x5644000) register_command perfcounters_dump hook 0x5624030
   -73> 2016-05-09 11:56:56.015739 7fa54d1fc7c0  5 asok(0x5644000) register_command 1 hook 0x5624030
   -72> 2016-05-09 11:56:56.015745 7fa54d1fc7c0  5 asok(0x5644000) register_command perf dump hook 0x5624030
   -71> 2016-05-09 11:56:56.015751 7fa54d1fc7c0  5 asok(0x5644000) register_command perfcounters_schema hook 0x5624030
   -70> 2016-05-09 11:56:56.015756 7fa54d1fc7c0  5 asok(0x5644000) register_command 2 hook 0x5624030
   -69> 2016-05-09 11:56:56.015758 7fa54d1fc7c0  5 asok(0x5644000) register_command perf schema hook 0x5624030
   -68> 2016-05-09 11:56:56.015763 7fa54d1fc7c0  5 asok(0x5644000) register_command perf reset hook 0x5624030
   -67> 2016-05-09 11:56:56.015766 7fa54d1fc7c0  5 asok(0x5644000) register_command config show hook 0x5624030
   -66> 2016-05-09 11:56:56.015770 7fa54d1fc7c0  5 asok(0x5644000) register_command config set hook 0x5624030
   -65> 2016-05-09 11:56:56.015773 7fa54d1fc7c0  5 asok(0x5644000) register_command config get hook 0x5624030
   -64> 2016-05-09 11:56:56.015776 7fa54d1fc7c0  5 asok(0x5644000) register_command config diff hook 0x5624030
   -63> 2016-05-09 11:56:56.015779 7fa54d1fc7c0  5 asok(0x5644000) register_command log flush hook 0x5624030
   -62> 2016-05-09 11:56:56.015783 7fa54d1fc7c0  5 asok(0x5644000) register_command log dump hook 0x5624030
   -61> 2016-05-09 11:56:56.015786 7fa54d1fc7c0  5 asok(0x5644000) register_command log reopen hook 0x5624030
   -60> 2016-05-09 11:56:56.017553 7fa54d1fc7c0  0 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 76017
   -59> 2016-05-09 11:56:56.027154 7fa54d1fc7c0  0 filestore(/var/lib/ceph/osd/ceph-11) backend xfs (magic 0x58465342)
   -58> 2016-05-09 11:56:56.028635 7fa54d1fc7c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is supported and appears to work
   -57> 2016-05-09 11:56:56.028644 7fa54d1fc7c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
   -56> 2016-05-09 11:56:56.042822 7fa54d1fc7c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
   -55> 2016-05-09 11:56:56.043047 7fa54d1fc7c0  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature: extsize is supported and kernel 4.5.2-1-ARCH >= 3.5
   -54> 2016-05-09 11:56:56.109483 7fa54d1fc7c0  0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
   -53> 2016-05-09 11:56:56.110110 7fa54d1fc7c0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
   -52> 2016-05-09 11:56:56.110825 7fa54d1fc7c0  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
   -51> 2016-05-09 11:56:56.114886 7fa54d1fc7c0  0 osd.11 9819 crush map has features 283675107524608, adjusting msgr requires for clients
   -50> 2016-05-09 11:56:56.114895 7fa54d1fc7c0  0 osd.11 9819 crush map has features 283675107524608 was 8705, adjusting msgr requires for mons
   -49> 2016-05-09 11:56:56.114899 7fa54d1fc7c0  0 osd.11 9819 crush map has features 283675107524608, adjusting msgr requires for osds
   -48> 2016-05-09 11:56:56.114919 7fa54d1fc7c0  0 osd.11 9819 load_pgs
   -47> 2016-05-09 11:56:57.000584 7fa54d1fc7c0  0 osd.11 9819 load_pgs opened 55 pgs
   -46> 2016-05-09 11:56:57.000991 7fa54d1fc7c0 -1 osd.11 9819 log_to_monitors {default=true}
   -45> 2016-05-09 11:56:57.052319 7fa54d1fc7c0  0 osd.11 9819 done with init, starting boot process
   -44> 2016-05-09 11:57:01.103141 7fa50d84e700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::3]:6804/75598 pipe(0x11152000 sd=80 :6806 s=0 pgs=0 cs=0 l=0 c=0x10ec6580).accept connect_seq 0 vs existing 0 state connecting
   -43> 2016-05-09 12:11:57.005225 7fa51a614700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6800/20004064 pipe(0x104da000 sd=33 :6098 s=2 pgs=934 cs=1 l=0 c=0x97de700).fault with nothing to send, going to standby
   -42> 2016-05-09 12:11:57.115818 7fa5188f7700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6807/20003350 pipe(0x10a13000 sd=68 :20154 s=2 pgs=1642 cs=1 l=0 c=0x1054e100).fault with nothing to send, going to standby
   -41> 2016-05-09 12:11:58.959851 7fa519806700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580 pipe(0x10845000 sd=47 :52276 s=2 pgs=1636 cs=1 l=0 c=0x1054d200).fault with nothing to send, going to standby
   -40> 2016-05-09 12:12:01.173115 7fa50d84e700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::3]:6804/75598 pipe(0x11152000 sd=80 :6806 s=2 pgs=25 cs=1 l=0 c=0x10ec5c80).fault with nothing to send, going to standby
   -39> 2016-05-09 12:19:10.564593 7fa539988700  0 -- [fd00:2380:0:20::3]:6806/76017 >> [fd00:2380:0:20::4]:0/1580628965 pipe(0x1285d000 sd=193 :6806 s=0 pgs=0 cs=0 l=0 c=0x10ec4d80).accept peer addr is really [fd00:2380:0:20::4]:0/1580628965 (socket is [fd00:2380:0:20::4]:29739/0)
   -38> 2016-05-09 12:59:17.227432 7fa509ea6700  0 -- [fd00:2380:0:20::3]:6806/76017 >> [fd00:2380:0:20::1]:0/2953656339 pipe(0x12122000 sd=271 :6806 s=0 pgs=0 cs=0 l=0 c=0x12549080).accept peer addr is really [fd00:2380:0:20::1]:0/2953656339 (socket is [fd00:2380:0:20::1]:41764/0)
   -37> 2016-05-10 02:35:46.567783 7fa52052d700  0 log_channel(cluster) log [INF] : 3.1db scrub starts
   -36> 2016-05-10 02:35:51.683231 7fa520d2e700  0 log_channel(cluster) log [INF] : 3.1db scrub ok
   -35> 2016-05-10 02:36:35.583938 7fa51fd2c700  0 log_channel(cluster) log [INF] : 3.2c scrub starts
   -34> 2016-05-10 02:36:40.647568 7fa520d2e700  0 log_channel(cluster) log [INF] : 3.2c scrub ok

At this point, the primary monitor marked the OSD as "down".

   -33> 2016-05-10 03:01:54.554677 7fa519b09700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6800/4229 pipe(0x10828000 sd=52 :6696 s=2 pgs=75 cs=1 l=0 c=0x1054ca80).fault with nothing to send, going to standby
   -32> 2016-05-10 03:01:54.554757 7fa51a715700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6813/22004022 pipe(0x104df000 sd=34 :4162 s=2 pgs=1199 cs=1 l=0 c=0x97deb80).fault with nothing to send, going to standby
   -31> 2016-05-10 03:01:54.554933 7fa5189f8700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787 pipe(0x10a25000 sd=67 :42436 s=2 pgs=1845 cs=1 l=0 c=0x1054ee80).fault with nothing to send, going to standby
   -30> 2016-05-10 03:01:54.555193 7fa51a412700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6810/71171 pipe(0x1052c000 sd=40 :9588 s=2 pgs=59 cs=1 l=0 c=0x97dea00).fault with nothing to send, going to standby
   -29> 2016-05-10 03:01:54.556082 7fa5185f4700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6804/3787 pipe(0x10a82000 sd=77 :16254 s=2 pgs=1053 cs=1 l=0 c=0x1054fd80).fault with nothing to send, going to standby
   -28> 2016-05-10 03:01:54.556134 7fa51a816700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6800/3315 pipe(0x104b9000 sd=32 :9106 s=2 pgs=627 cs=1 l=0 c=0x97de280).fault with nothing to send, going to standby
   -27> 2016-05-10 03:01:54.556252 7fa51a917700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6806/4027 pipe(0x5bde000 sd=28 :12780 s=2 pgs=1145 cs=1 l=0 c=0x97dd800).fault with nothing to send, going to standby
   -26> 2016-05-10 03:01:54.557199 7fa5186f5700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6803/126743 pipe(0x10a6c000 sd=72 :43264 s=2 pgs=62 cs=1 l=0 c=0x1054f900).fault with nothing to send, going to standby
   -25> 2016-05-10 03:01:54.557588 7fa519a08700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6808/5200 pipe(0x10840000 sd=54 :36606 s=2 pgs=51 cs=1 l=0 c=0x1054cf00).fault with nothing to send, going to standby
   -24> 2016-05-10 03:01:54.557850 7fa518cfb700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6806/4952 pipe(0x10531000 sd=65 :42550 s=2 pgs=61 cs=1 l=0 c=0x97dfd80).fault with nothing to send, going to standby
   -23> 2016-05-10 03:01:54.558839 7fa519604700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546 pipe(0x10978000 sd=27 :52456 s=2 pgs=937 cs=1 l=0 c=0x1054d680).fault with nothing to send, going to standby
   -22> 2016-05-10 03:01:54.558903 7fa519402700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6811/102397 pipe(0x1097d000 sd=38 :36150 s=2 pgs=62 cs=1 l=0 c=0x1054db00).fault with nothing to send, going to standby
   -21> 2016-05-10 03:01:54.559397 7fa518bfa700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6802/93235 pipe(0x10a0e000 sd=48 :6828 s=2 pgs=54 cs=1 l=0 c=0x1054de00).fault with nothing to send, going to standby
   -20> 2016-05-10 03:01:54.559578 7fa51a311700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6812/22043768 pipe(0x1050e000 sd=29 :52358 s=2 pgs=830 cs=1 l=0 c=0x97de580).fault with nothing to send, going to standby
   -19> 2016-05-10 03:01:54.560105 7fa51a513700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6813/22079394 pipe(0x104fc000 sd=35 :3652 s=2 pgs=855 cs=1 l=0 c=0x97ddf80).fault with nothing to send, going to standby
   -18> 2016-05-10 03:01:54.560870 7fa519c0a700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6809/22003317 pipe(0x106ff000 sd=50 :32878 s=2 pgs=1407 cs=1 l=0 c=0x1054c600).fault with nothing to send, going to standby
   -17> 2016-05-10 03:01:54.561127 7fa51aa18700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6802/4449 pipe(0x104b4000 sd=31 :30636 s=2 pgs=111 cs=1 l=0 c=0x97ddc80).fault with nothing to send, going to standby
   -16> 2016-05-10 03:01:54.562713 7fa5187f6700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6808/63020 pipe(0x10a20000 sd=64 :7038 s=2 pgs=608 cs=1 l=0 c=0x1054e400).fault with nothing to send, going to standby
   -15> 2016-05-10 03:01:54.564040 7fa51a210700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6802/3556 pipe(0x10501000 sd=30 :41532 s=2 pgs=1367 cs=1 l=0 c=0x97df180).fault with nothing to send, going to standby
   -14> 2016-05-10 03:01:54.564845 7fa51a00e700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6809/21003823 pipe(0x106fa000 sd=41 :37786 s=2 pgs=1743 cs=1 l=0 c=0x1054c180).fault with nothing to send, going to standby
   -13> 2016-05-10 03:02:19.610294 7fa519e0c700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::1]:6800/4229 pipe(0x10828000 sd=19 :7102 s=1 pgs=75 cs=1 l=0 c=0x1054ca80).connect got RESETSESSION
   -12> 2016-05-10 03:02:19.610500 7fa518dfc700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6803/126743 pipe(0x10a6c000 sd=17 :43648 s=1 pgs=62 cs=1 l=0 c=0x1054f900).connect got RESETSESSION
   -11> 2016-05-10 03:02:20.509905 7fa519907700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580 pipe(0x10845000 sd=22 :52680 s=1 pgs=1636 cs=1 l=0 c=0x1054d200).connect got RESETSESSION
   -10> 2016-05-10 03:02:20.510001 7fa518ffe700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787 pipe(0x10a25000 sd=23 :42830 s=1 pgs=1845 cs=1 l=0 c=0x1054ee80).connect got RESETSESSION
    -9> 2016-05-10 03:02:21.372603 7fa51a10f700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6809/21003823 pipe(0x106fa000 sd=25 :38202 s=1 pgs=1743 cs=1 l=0 c=0x1054c180).connect got RESETSESSION
    -8> 2016-05-10 03:02:21.372712 7fa519705700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546 pipe(0x10978000 sd=27 :52864 s=1 pgs=937 cs=1 l=0 c=0x1054d680).connect got RESETSESSION
    -7> 2016-05-10 03:02:21.392847 7fa51a00e700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546 pipe(0x10978000 sd=27 :52864 s=2 pgs=941 cs=1 l=0 c=0x1054d680).fault, initiating reconnect
    -6> 2016-05-10 03:02:21.393394 7fa519705700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6801/21003546 pipe(0x10978000 sd=27 :52868 s=1 pgs=941 cs=2 l=0 c=0x1054d680).connect got RESETSESSION
    -5> 2016-05-10 03:02:21.393697 7fa51b11f700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::4]:6800/3315 pipe(0x104b9000 sd=32 :9540 s=1 pgs=627 cs=1 l=0 c=0x97de280).connect got RESETSESSION
    -4> 2016-05-10 03:02:24.368774 7fa519806700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787 pipe(0x10a25000 sd=23 :42830 s=2 pgs=1848 cs=1 l=0 c=0x1054ee80).fault, initiating reconnect
    -3> 2016-05-10 03:02:24.368786 7fa5189f8700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580 pipe(0x10845000 sd=22 :52680 s=2 pgs=1641 cs=1 l=0 c=0x1054d200).fault, initiating reconnect
    -2> 2016-05-10 03:02:24.369396 7fa518ffe700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::5]:6800/20003787 pipe(0x10a25000 sd=23 :42842 s=1 pgs=1848 cs=2 l=0 c=0x1054ee80).connect got RESETSESSION
    -1> 2016-05-10 03:02:24.369480 7fa519907700  0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::2]:6817/21003580 pipe(0x10845000 sd=22 :52692 s=1 pgs=1641 cs=2 l=0 c=0x1054d200).connect got RESETSESSION

After several minutes, osd.11 was marked OUT by mon.0, resulting in some data movement.

2016-05-10 03:07:01.826746 mon.0 [fd00:2380:0:20::c9]:6789/0 11317108 : cluster [INF] osd.11 out (down for 307.274008)

The next thing we see is osd.11 terminating because of an assertion.
     0> 2016-05-10 03:16:30.718936 7fa5166dc700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa5166dc700 time 2016-05-10 03:16:30.688044
common/Mutex.cc: 100: FAILED assert(r == 0)

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0xb34520]
 2: (Mutex::Lock(bool)+0x105) [0xae2395]
 3: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]
 4: (Pipe::fault(bool)+0x426) [0xc16256]
 5: (Pipe::reader()+0x3f2) [0xc1d752]
 6: (Pipe::Reader::entry()+0xd) [0xc2880d]
 7: (()+0x7474) [0x7fa54c70b474]
 8: (clone()+0x6d) [0x7fa54ac01acd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   1/ 5 mon
   0/ 0 monc
   1/ 5 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.11.log
--- end dump of recent events ---
2016-05-10 03:16:30.721888 7fa5166dc700 -1 *** Caught signal (Aborted) **
 in thread 7fa5166dc700

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-osd() [0xa36cfd]
 2: (()+0x10ef0) [0x7fa54c714ef0]
 3: (gsignal()+0x35) [0x7fa54ab4d295]
 4: (abort()+0x16a) [0x7fa54ab4e6da]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fa54b4663dd]
 6: (()+0x8f196) [0x7fa54b464196]
 7: (()+0x8f1e1) [0x7fa54b4641e1]
 8: (()+0x8f3f8) [0x7fa54b4643f8]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27e) [0xb3471e]
 10: (Mutex::Lock(bool)+0x105) [0xae2395]
 11: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]
 12: (Pipe::fault(bool)+0x426) [0xc16256]
 13: (Pipe::reader()+0x3f2) [0xc1d752]
 14: (Pipe::Reader::entry()+0xd) [0xc2880d]
 15: (()+0x7474) [0x7fa54c70b474]
 16: (clone()+0x6d) [0x7fa54ac01acd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2016-05-10 03:16:30.721888 7fa5166dc700 -1 *** Caught signal (Aborted) **
 in thread 7fa5166dc700

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-osd() [0xa36cfd]
 2: (()+0x10ef0) [0x7fa54c714ef0]
 3: (gsignal()+0x35) [0x7fa54ab4d295]
 4: (abort()+0x16a) [0x7fa54ab4e6da]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fa54b4663dd]
 6: (()+0x8f196) [0x7fa54b464196]
 7: (()+0x8f1e1) [0x7fa54b4641e1]
 8: (()+0x8f3f8) [0x7fa54b4643f8]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27e) [0xb3471e]
 10: (Mutex::Lock(bool)+0x105) [0xae2395]
 11: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]
 12: (Pipe::fault(bool)+0x426) [0xc16256]
 13: (Pipe::reader()+0x3f2) [0xc1d752]
 14: (Pipe::Reader::entry()+0xd) [0xc2880d]
 15: (()+0x7474) [0x7fa54c70b474]
 16: (clone()+0x6d) [0x7fa54ac01acd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   1/ 5 mon
   0/ 0 monc
   1/ 5 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.11.log
--- end dump of recent events ---

Since we have been having issues with the FD usage of some OSD's (although they never terminated), we've already set the following within the systemd-Service file for ceph-osd's:
LimitNOFILE=1048576
LimitNPROC=1048576
TasksMax=infinity

Only one OSD was affected which is why i immediately checked these limits, both have been well below the set max. There is also no issue with the actual SSD its running on. Smart reports that everything is good, counters are fine etc. Also, while this all happened, all the other OSDs on the same node, have been running flawlessly and without issues which makes me rule out a network problem.

The osd has been restarted and is working flawlessly since then.

I would appreciate a hint as to what might be causing this. I am not too deep into debugging these kind of assertions myself but i will provide anything i can.

Thanks,
Torsten

PS: Copy to ceph-users@


Files

ceph.conf.gz (714 Bytes) ceph.conf.gz Torsten Urbas, 05/10/2016 11:09 AM
ceph-osd.11.log.gz (5.61 KB) ceph-osd.11.log.gz Torsten Urbas, 05/10/2016 11:09 AM
monitor.0.log.gz (55.7 KB) monitor.0.log.gz Torsten Urbas, 05/10/2016 11:09 AM
Actions #1

Updated by Torsten Urbas almost 8 years ago

I was unable to upload the gzip'd objdump. If required, i will provide that as download somewhere.

Actions #2

Updated by Haomai Wang almost 8 years ago

is this can be reproduce? set debug_ms=20/20 will more helpful to debug.

Actions #3

Updated by Torsten Urbas almost 8 years ago

This happened for the first time tonight. There was nothing special being done to the cluster - other than the usual nightly scrubs that have been running plus some minor IO.

I will set the debug option and see if we can reproduce that, although it seemed very random.

Actions #4

Updated by Torsten Urbas almost 8 years ago

Unfortunately the logging output is way too much, we cant keep this up while we wait for this to happen again. As much as i would love to do this i cant do that in a production environment.

Is there anything else we can do to help locate this?

Thanks,

Actions #5

Updated by Greg Farnum almost 7 years ago

  • Status changed from New to Can't reproduce

The crash here is about ephemeral state so there's no risk to the data being stored. I think we've fixed a few races in Pipe "reaping" (when connections get shut down) and I'd guess it was something to do with one of those.

Actions

Also available in: Atom PDF