Project

General

Profile

Actions

Bug #4274

closed

osd: FAILED assert(osd_lock.is_locked())

Added by Wido den Hollander about 11 years ago. Updated about 11 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

So I just got a notice from my test cluster that is was unhealthy.

I checked and 7 out of 12 OSDs died with the same backtrace:

     0> 2013-02-26 20:13:15.728819 7f3fcd7c6700 -1 osd/OSD.cc: In function 'void OSD::do_waiters()' thread 7f3fcd7c6700 time 2013-02-26 20:13:15.705429
osd/OSD.cc: 3357: FAILED assert(osd_lock.is_locked())

 ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
 1: (OSD::do_waiters()+0x2c3) [0x629ea3]
 2: (OSD::ms_dispatch(Message*)+0x1c2) [0x6328c2]
 3: (DispatchQueue::entry()+0x349) [0x8c1199]
 4: (DispatchQueue::DispatchThread::entry()+0xd) [0x81a0bd]
 5: (()+0x7e9a) [0x7f3fd9dd8e9a]
 6: (clone()+0x6d) [0x7f3fd885ccbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Going through their logs they are all going down at about the same time, maybe a 5 minute interval at max.

I saw this one before a couple of days ago and what stands out that from the 3 nodes all the OSDs on one of the nodes survive:

# id    weight    type name    up/down    reweight
-1    12    root default
-3    12        rack unknownrack
-2    4            host ceph01
0    1                osd.0    down    0    
1    1                osd.1    down    0    
2    1                osd.2    up    1    
3    1                osd.3    down    0    
-4    4            host ceph02
4    1                osd.4    down    0    
5    1                osd.5    down    0    
6    1                osd.6    down    0    
7    1                osd.7    down    0    
-5    4            host ceph03
8    1                osd.8    up    1    
9    1                osd.9    up    1    
10    1                osd.10    up    1    
11    1                osd.11    up    1

To this test cluster one hypervisor is connected which runs one virtual machine. This virtual machine mirrors a Ubuntu repository which is done by running rsync inside the VM every hour.

So they all seem pretty busy before it happens, I see lines like these in the logs:

filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 84956 > 104857600

And they seem to run into trouble with connecting to eachother:

 -2985> 2013-02-26 20:13:21.227224 7f62c2a6b700  5 osd.0 1856 tick
 -2984> 2013-02-26 20:13:21.227305 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.5 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303)
 -2983> 2013-02-26 20:13:21.227345 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.6 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303)
 -2982> 2013-02-26 20:13:21.227363 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.7 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303)
 -2981> 2013-02-26 20:13:21.227382 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.10 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303)
 -2980> 2013-02-26 20:13:21.227401 7f62c2a6b700 -1 osd.0 1856 heartbeat_check: no reply from osd.11 since 2013-02-26 20:12:57.761992 (cutoff 2013-02-26 20:13:01.227303)

Now, these systems are not to powerfull, they are just AMD C60 CPUs with 8GB of memory, but my statistics show they were 96% idle when this happend.

When I start the OSDs again all works just fine and it recovers.

Like I said, this is about the third time I noticed it.

I also checked the logs of osd.10 (which didn't crash) and it shows:

2013-02-26 20:13:02.649481 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:42.649479)
2013-02-26 20:13:02.649545 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:42.649479)
2013-02-26 20:13:02.649571 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:42.649479)
2013-02-26 20:13:03.474255 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.474253)
2013-02-26 20:13:03.474280 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.474253)
2013-02-26 20:13:03.474298 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.474253)
2013-02-26 20:13:03.650087 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.650086)
2013-02-26 20:13:03.650129 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.650086)
2013-02-26 20:13:03.650148 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:43.650086)
2013-02-26 20:13:04.650557 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:44.650555)
2013-02-26 20:13:04.650599 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:44.650555)
2013-02-26 20:13:04.650617 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:44.650555)
2013-02-26 20:13:05.650832 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.650831)
2013-02-26 20:13:05.650873 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.650831)
2013-02-26 20:13:05.650893 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.650831)
2013-02-26 20:13:05.775670 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.775669)
2013-02-26 20:13:05.775772 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.775669)
2013-02-26 20:13:05.775794 7fd85367d700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:45.775669)
2013-02-26 20:13:06.651178 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.0 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:46.651177)
2013-02-26 20:13:06.651219 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.1 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:46.651177)
2013-02-26 20:13:06.651238 7fd86269b700 -1 osd.10 1856 heartbeat_check: no reply from osd.3 since 2013-02-26 20:12:42.365716 (cutoff 2013-02-26 20:12:46.651177)
2013-02-26 20:13:06.756887 7fd85a68b700  0 log [WRN] : map e1857 wrongly marked me down
2013-02-26 20:13:10.729363 7fd84fc58700  0 -- [2a00:f10:113:0:798a:9f8a:6141:cfd9]:6810/1258 >> [2a00:f10:113:0:a597:2eeb:1f35:7d2d]:6801/1119 pipe(0x54c0500 sd=38 :6810 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state connecting
2013-02-26 20:13:14.621363 7fd851470700  0 -- [2a00:f10:113:0:798a:9f8a:6141:cfd9]:6810/1258 >> [2a00:f10:113:0:d585:1138:64c6:be36]:6812/1116 pipe(0x7ce2a00 sd=32 :6810 s=2 pgs=41029 cs=1 l=0).fault with nothing to send, going to standby
2013-02-26 20:13:30.082278 7fd85367d700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:10.082276)
2013-02-26 20:13:30.662580 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:10.662576)
2013-02-26 20:13:31.663007 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:11.663004)
2013-02-26 20:13:32.663434 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:12.663431)
2013-02-26 20:13:33.663993 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:13.663990)
2013-02-26 20:13:34.664564 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:14.664561)
2013-02-26 20:13:35.665123 7fd86269b700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:15.665120)
2013-02-26 20:13:35.982974 7fd85367d700 -1 osd.10 1862 heartbeat_check: no reply from osd.4 ever, first ping sent 2013-02-26 20:13:09.876085 (cutoff 2013-02-26 20:13:15.982972)

So suddenly those OSDs seem to get into trouble and stop responding and crash.


Files

ceph-osd.0.log.gz (2.14 MB) ceph-osd.0.log.gz Wido den Hollander, 02/26/2013 01:06 PM
ceph-osd.3.log.gz (2.19 MB) ceph-osd.3.log.gz Wido den Hollander, 02/26/2013 01:06 PM
ceph-osd.4.log.gz (2.12 MB) ceph-osd.4.log.gz Wido den Hollander, 02/26/2013 01:06 PM
ceph-osd.7.log.gz (2.21 MB) ceph-osd.7.log.gz Wido den Hollander, 02/26/2013 01:06 PM
Actions #1

Updated by Sage Weil about 11 years ago

  • Status changed from New to Duplicate

this is actually a dup of #3816.. reported by you! :) Cherry-picked the fix to bobtail, 8c6f52215240f48b5e4d5bb99a5f2f451e7ce70a

Actions

Also available in: Atom PDF