Project

General

Profile

Backport #13512

If blkid hangs, ceph-osd appears to start but does not come up on mon, and gdb can't backtrace (aka "2 of 4 OSDs are up")

Added by John Spray over 1 year ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Release:
jewel

History

#2 Updated by John Spray 10 months ago

Hadn't seen this one for a while, but here's an instance:
http://qa-proxy.ceph.com/teuthology/jspray-2016-05-25_19:00:36-fs-wip-jcsp-testing-20160526---basic-mira/215434/teuthology.log

(it's on a node that has a dead /dev/rbd node, as before)

#3 Updated by John Spray 10 months ago

  • Assignee set to Joao Luis

Joe: do you think libstoragemgmt is going to help us here or will we still need to handle cases like this somehow in ceph?

#4 Updated by John Spray 10 months ago

  • Assignee deleted (Joao Luis)

(Sorry Joao, clicked the wrong thing there)

#5 Updated by Ilya Dryomov about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to Ilya Dryomov

#6 Updated by Ilya Dryomov about 2 months ago

< jcsp> huh, just ran into http://tracker.ceph.com/issues/13512 for the first time in ages
< jcsp> I wonder if we just went a very long time without any rbd tests leaving stale device nodes on the system

http://pulpito.ceph.com/jspray-2017-01-25_12:42:21-kcephfs-master-testing-basic-smithi/745895/

On smithi022:

$ rbd showmapped
id pool image                snap device    
0  rbd  i2flayeringsmithi022 -    /dev/rbd0

from http://qa-proxy.ceph.com/teuthology/teuthology-2017-01-25_10:15:02-krbd-jewel-testing-basic-smithi/745512

rbd_fio task in jewel is missing a backport of https://github.com/ceph/ceph-qa-suite/pull/1158

#7 Updated by Ilya Dryomov about 2 months ago

  • Status changed from In Progress to Need Review

#8 Updated by Nathan Cutler about 2 months ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)
  • Status changed from Need Review to Resolved
  • Target version set to v10.2.6

description

This is happening at startup in a small minority of test runs.

teuthology-2015-10-12_23:08:03-kcephfs-master-testing-basic-multi/1105038/

The ceph-osd daemons are starting, their logs are happily spinning away, but they're not getting as far as sending their boot messages to the mon.

I caught one in the act, and tried to attach a debugger, gdb hung, tried to run a fresh osd process in a debugger and it hung on ctrl-c.

I happened to notice that the host mira106 had some dead krbd volumes (presumably from some other test, see #13510).

It seems highly likely that the OSD process is hanging inside get_device_by_uuid. For some reason the heartbeat map doesn't care that this thread is hanging.

#9 Updated by Nathan Cutler about 2 months ago

  • Release jewel added

Also available in: Atom PDF