Bug #3770
closedOSD crashes on boot
0%
Description
One of my 0.56.1 OSDs crashed and couldn't boot: it was reaching tp_op heartbeats, and even after increasing that I was getting nothing but:
2013-01-08 23:57:25.337731 7fc515c26700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.174:6818/8710 pipe(0x3cceb6c0 sd=56 :0 pgs=0 cs=0 l=0).fault with nothing to send, going to standby
2013-01-08 23:57:29.043846 7fc515b25700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.174:6845/4111 pipe(0x3cceb240 sd=57 :32953 pgs=0 cs=0 l=0).connect claims to be 10.64.0.174:6845/11414 not 10.64.0.174:6845/4111 - wrong node!
2013-01-08 23:57:29.043957 7fc515b25700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.174:6845/4111 pipe(0x3cceb240 sd=57 :32953 pgs=0 cs=0 l=0).fault with nothing to send, going to standby
2013-01-08 23:57:38.310206 7fc515a24700 0 -- 0.0.0.0:6805/31953 >> 10.64.0.173:6842/821 pipe(0x16bf0d80 sd=58 :0 pgs=0 cs=0 l=0).fault with nothing to send, going to standby
I waited a few hours and left the cluster to recover and become healthy again. Now it's HEALTH_OK and all pgs are active+clean.
However, when trying now to start the OSD in question, it immediately dies on boot on assert(_get_map_bl(epoch, bl)). Attached is the --debug_ms 20 --debug_osd 20 log and a full backtrace from gdb.
This is on ceph.com 0.56.1 packages in a Ubuntu 12.04 LTS platform.
Files
Updated by Samuel Just over 11 years ago
- Status changed from New to Need More Info
From the backtrace:
pgid = {m_pool = 4, m_seed = 249, m_preferred = -1}
Based on the info attr, we try to load map 10705, which is 20k maps behind the other pgs. This suggests that the attr may be invalid.
Can you attach a hex dump of the attributes on the current/4.f9_head collection on the crashed osd?
Updated by Faidon Liambotis over 11 years ago
root@ms-be1003:/var/lib/ceph/osd/ceph-27/current/4.f9_head# attr -lq $PWD | while read attr; do echo $attr; attr -q -g $attr $PWD | hd; echo; done
cephos.collection_version
00000000 03 00 00 00 |....|
00000004
cephos.phash.contents
00000000 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000010 00 |.|
00000011
ceph.ondisklog
00000000 05 03 1c 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000022
ceph.info
00000000 05 d1 29 00 00 |..)..|
00000005
For what it's worth, pool 4 is .rgw.gc.
Updated by Faidon Liambotis over 11 years ago
- File ceph-osd.27.meta.gz ceph-osd.27.meta.gz added
root@ms-be1003:/var/lib/ceph/osd/ceph-27# find current/meta/ | tee ~/ceph-osd.27.meta | wc -l
42992
Attached.
Updated by Faidon Liambotis over 11 years ago
sjust said that we're done collecting information and that I could rm the pg directory/log/info, which I did. Unfortunately, it keeps crashing on boot , so there are probably more PGs like that...
Updated by Mike Dawson over 11 years ago
I'm seeing this same assert failure when trying to startup 3 of my OSDs. Happy to provide feedback for the debugging effort if needed.
Updated by Samuel Just over 11 years ago
The fault is in OSD::handle_osd_map where we trim old maps. Prior to 0.50, the pgs would have processed up to the current OSD map by this point. Post 0.50, however, pgs may lag behind the OSD map epoch. In an extreme case, the OSD might trim past a map needed by a PG. This is what happened here. Working on patch now.
Updated by Samuel Just over 11 years ago
- Status changed from Fix Under Review to Resolved
66eb93b83648b4561b77ee6aab5b484e6dba4771
Updated by Faidon Liambotis over 11 years ago
So, my (very basic) understanding of this suggests that the fix is that the trim wouldn't happen in the first place.
How about the crash that I'm experiencing right now though? Would it be possible for the OSD to recover without the manual action of deleting PGs from the filesystem?
Updated by Samuel Just over 11 years ago
Yeah, I just pushed a work-around branch (which I haven't tested much, so ideally you would try it on a node you can afford to lose) wip-bobtail-load_pgs-workaround. There is a scenario in which this would be a problem, but if you have not been expanding the number of pgs in your pools, you won't hit it.