Bug #4274
closedosd: FAILED assert(osd_lock.is_locked())
0%
Description
So I just got a notice from my test cluster that is was unhealthy.
I checked and 7 out of 12 OSDs died with the same backtrace:
0> 2013-02-26 20:13:15.728819 7f3fcd7c6700 -1 osd/OSD.cc: In function 'void OSD::do_waiters()' thread 7f3fcd7c6700 time 2013-02-26 20:13:15.705429 osd/OSD.cc: 3357: FAILED assert(osd_lock.is_locked()) ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5) 1: (OSD::do_waiters()+0x2c3) [0x629ea3] 2: (OSD::ms_dispatch(Message*)+0x1c2) [0x6328c2] 3: (DispatchQueue::entry()+0x349) [0x8c1199] 4: (DispatchQueue::DispatchThread::entry()+0xd) [0x81a0bd] 5: (()+0x7e9a) [0x7f3fd9dd8e9a] 6: (clone()+0x6d) [0x7f3fd885ccbd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Going through their logs they are all going down at about the same time, maybe a 5 minute interval at max.
I saw this one before a couple of days ago and what stands out that from the 3 nodes all the OSDs on one of the nodes survive:
# id weight type name up/down reweight -1 12 root default -3 12 rack unknownrack -2 4 host ceph01 0 1 osd.0 down 0 1 1 osd.1 down 0 2 1 osd.2 up 1 3 1 osd.3 down 0 -4 4 host ceph02 4 1 osd.4 down 0 5 1 osd.5 down 0 6 1 osd.6 down 0 7 1 osd.7 down 0 -5 4 host ceph03 8 1 osd.8 up 1 9 1 osd.9 up 1 10 1 osd.10 up 1 11 1 osd.11 up 1
To this test cluster one hypervisor is connected which runs one virtual machine. This virtual machine mirrors a Ubuntu repository which is done by running rsync inside the VM every hour.
So they all seem pretty busy before it happens, I see lines like these in the logs:
filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 84956 > 104857600
And they seem to run into trouble with connecting to eachother:
-2985> 2013-02-26 20:13:21.227224 7f62c2a6b700 5 osd.0 1856 tick -2984> 2013-02-26 20:13:21.227305 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.5 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303) -2983> 2013-02-26 20:13:21.227345 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.6 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303) -2982> 2013-02-26 20:13:21.227363 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.7 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303) -2981> 2013-02-26 20:13:21.227382 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.10 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303) -2980> 2013-02-26 20:13:21.227401 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.11 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303)
Now, these systems are not to powerfull, they are just AMD C60 CPUs with 8GB of memory, but my statistics show they were 96% idle when this happend.
When I start the OSDs again all works just fine and it recovers.
Like I said, this is about the third time I noticed it.
I also checked the logs of osd.10 (which didn't crash) and it shows:
2013-02-26 20:13:02.649481 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:42.649479) 2013-02-26 20:13:02.649545 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:42.649479) 2013-02-26 20:13:02.649571 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:42.649479) 2013-02-26 20:13:03.474255 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.474253) 2013-02-26 20:13:03.474280 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.474253) 2013-02-26 20:13:03.474298 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.474253) 2013-02-26 20:13:03.650087 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.650086) 2013-02-26 20:13:03.650129 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.650086) 2013-02-26 20:13:03.650148 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.650086) 2013-02-26 20:13:04.650557 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:44.650555) 2013-02-26 20:13:04.650599 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:44.650555) 2013-02-26 20:13:04.650617 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:44.650555) 2013-02-26 20:13:05.650832 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.650831) 2013-02-26 20:13:05.650873 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.650831) 2013-02-26 20:13:05.650893 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.650831) 2013-02-26 20:13:05.775670 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.775669) 2013-02-26 20:13:05.775772 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.775669) 2013-02-26 20:13:05.775794 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.775669) 2013-02-26 20:13:06.651178 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:46.651177) 2013-02-26 20:13:06.651219 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:46.651177) 2013-02-26 20:13:06.651238 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:46.651177) 2013-02-26 20:13:06.756887 7fd85a68b700 0 log [WRN] : map e1857 wrongly marked me down 2013-02-26 20:13:10.729363 7fd84fc58700 0 -- [2a00:f10:113:0:798a:9f8a:6141:cfd9]:6810/1258 >> [2a00:f10:113:0:a597:2eeb:1f35:7d2d]:6801/1119 pipe(0x54c0500 sd=38 :6810 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state connecting 2013-02-26 20:13:14.621363 7fd851470700 0 -- [2a00:f10:113:0:798a:9f8a:6141:cfd9]:6810/1258 >> [2a00:f10:113:0:d585:1138:64c6:be36]:6812/1116 pipe(0x7ce2a00 sd=32 :6810 s=2 pgs=41029 cs=1 l=0).fault with nothing to send, going to standby 2013-02-26 20:13:30.082278 7fd85367d700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:10.082276) 2013-02-26 20:13:30.662580 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:10.662576) 2013-02-26 20:13:31.663007 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:11.663004) 2013-02-26 20:13:32.663434 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:12.663431) 2013-02-26 20:13:33.663993 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:13.663990) 2013-02-26 20:13:34.664564 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:14.664561) 2013-02-26 20:13:35.665123 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:15.665120) 2013-02-26 20:13:35.982974 7fd85367d700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:15.982972)
So suddenly those OSDs seem to get into trouble and stop responding and crash.
Files
Updated by Sage Weil about 11 years ago
- Status changed from New to Duplicate
this is actually a dup of #3816.. reported by you! :) Cherry-picked the fix to bobtail, 8c6f52215240f48b5e4d5bb99a5f2f451e7ce70a