Backport #13512
If blkid hangs, ceph-osd appears to start but does not come up on mon, and gdb can't backtrace (aka "2 of 4 OSDs are up")
History
#1 Updated by John Spray over 8 years ago
#2 Updated by John Spray almost 8 years ago
Hadn't seen this one for a while, but here's an instance:
http://qa-proxy.ceph.com/teuthology/jspray-2016-05-25_19:00:36-fs-wip-jcsp-testing-20160526---basic-mira/215434/teuthology.log
(it's on a node that has a dead /dev/rbd node, as before)
#3 Updated by John Spray almost 8 years ago
- Assignee set to Joao Eduardo Luis
Joe: do you think libstoragemgmt is going to help us here or will we still need to handle cases like this somehow in ceph?
#4 Updated by John Spray almost 8 years ago
- Assignee deleted (
Joao Eduardo Luis)
(Sorry Joao, clicked the wrong thing there)
#5 Updated by Ilya Dryomov about 7 years ago
- Status changed from New to In Progress
- Assignee set to Ilya Dryomov
#6 Updated by Ilya Dryomov about 7 years ago
< jcsp> huh, just ran into http://tracker.ceph.com/issues/13512 for the first time in ages < jcsp> I wonder if we just went a very long time without any rbd tests leaving stale device nodes on the system
http://pulpito.ceph.com/jspray-2017-01-25_12:42:21-kcephfs-master-testing-basic-smithi/745895/
On smithi022:
$ rbd showmapped id pool image snap device 0 rbd i2flayeringsmithi022 - /dev/rbd0
rbd_fio task in jewel is missing a backport of https://github.com/ceph/ceph-qa-suite/pull/1158
#7 Updated by Ilya Dryomov about 7 years ago
- Status changed from In Progress to Fix Under Review
#8 Updated by Nathan Cutler about 7 years ago
- Tracker changed from Bug to Backport
- Description updated (diff)
- Status changed from Fix Under Review to Resolved
- Target version set to v10.2.6
description¶
This is happening at startup in a small minority of test runs.
teuthology-2015-10-12_23:08:03-kcephfs-master-testing-basic-multi/1105038/
The ceph-osd daemons are starting, their logs are happily spinning away, but they're not getting as far as sending their boot messages to the mon.
I caught one in the act, and tried to attach a debugger, gdb hung, tried to run a fresh osd process in a debugger and it hung on ctrl-c.
I happened to notice that the host mira106 had some dead krbd volumes (presumably from some other test, see #13510).
It seems highly likely that the OSD process is hanging inside get_device_by_uuid. For some reason the heartbeat map doesn't care that this thread is hanging.
#9 Updated by Nathan Cutler about 7 years ago
- Release set to jewel