If blkid hangs, ceph-osd appears to start but does not come up on mon, and gdb can't backtrace (aka "2 of 4 OSDs are up")
#1 Updated by John Spray over 1 year ago
#2 Updated by John Spray 11 months ago
Hadn't seen this one for a while, but here's an instance:
(it's on a node that has a dead /dev/rbd node, as before)
#6 Updated by Ilya Dryomov 3 months ago
< jcsp> huh, just ran into http://tracker.ceph.com/issues/13512 for the first time in ages < jcsp> I wonder if we just went a very long time without any rbd tests leaving stale device nodes on the system
$ rbd showmapped id pool image snap device 0 rbd i2flayeringsmithi022 - /dev/rbd0
rbd_fio task in jewel is missing a backport of https://github.com/ceph/ceph-qa-suite/pull/1158
#8 Updated by Nathan Cutler 3 months ago
- Tracker changed from Bug to Backport
- Description updated (diff)
- Status changed from Need Review to Resolved
- Target version set to v10.2.6
This is happening at startup in a small minority of test runs.
The ceph-osd daemons are starting, their logs are happily spinning away, but they're not getting as far as sending their boot messages to the mon.
I caught one in the act, and tried to attach a debugger, gdb hung, tried to run a fresh osd process in a debugger and it hung on ctrl-c.
I happened to notice that the host mira106 had some dead krbd volumes (presumably from some other test, see #13510).
It seems highly likely that the OSD process is hanging inside get_device_by_uuid. For some reason the heartbeat map doesn't care that this thread is hanging.