Project

General

Profile

Bug #5070

rbd map failed and stalled in "D"

Added by Ivan Kudryavtsev almost 11 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It happened with one device while others are mapped well.

[6765922.875713] ------------[ cut here ]------------
[6765922.875773] WARNING: at fs/sysfs/group.c:142 blk_unregister_queue+0x60/0x6f()
[6765922.875845] Hardware name: X9DRW
[6765922.875884] sysfs group ffffffff8162d8c0 not found for kobject 'rbd83'
[6765922.875941] Modules linked in: flashcache(OF) isofs ebt_ip ebtable_filter ebtables x_tables tun cn fuse cbc ceph nfsv3 nfs_acl nfs fscache dns_resolver lockd sunrpc bridge bonding rbd libceph libcrc32c 8021q garp stp llc ib_ipath ib_uverbs ib_addr ib_umad ib_ipoib ib_cm ib_sa loop coretemp kvm_intel acpi_cpufreq tpm_tis tpm kvm snd_pcm sg snd_timer snd soundcore snd_page_alloc joydev crc32c_intel evdev hid_generic tpm_bios mperf sr_mod pcspkr cdrom processor thermal_sys psmouse lpc_ich mfd_core ib_mthca microcode wmi i2c_i801 i2c_core ib_mad serio_raw shpchp pci_hotplug ib_core ioatdma container button ext4 mbcache jbd2 crc16 dm_mod usb_storage raid1 md_mod usbhid hid sd_mod crc_t10dif ahci libahci isci libsas libata scsi_transport_sas ehci_hcd usbcore scsi_mod usb_common igb dca
[6765922.876973] Pid: 13555, comm: rbd Tainted: GF D W O 3.7.2-ceph #1
[6765922.877019] Call Trace:
[6765922.877060] [<ffffffff8103efdc>] ? warn_slowpath_common+0x78/0x8c
[6765922.877105] [<ffffffff8103f08f>] ? warn_slowpath_fmt+0x45/0x4a
[6765922.877152] [<ffffffff81162fc8>] ? sysfs_remove_group+0x26/0xbc
[6765922.877198] [<ffffffff811a87db>] ? blk_unregister_queue+0x60/0x6f
[6765922.877244] [<ffffffff811ae16e>] ? del_gendisk+0x113/0x1c0
[6765922.877292] [<ffffffffa02cc9e5>] ? rbd_free_disk+0x1b/0x37 [rbd]
[6765922.877337] [<ffffffffa02cd03d>] ? rbd_dev_release+0xd4/0x14a [rbd]
[6765922.877384] [<ffffffff81260c7a>] ? device_release+0x4e/0x83
[6765922.877430] [<ffffffff811bd8b8>] ? kobject_release+0x48/0x5e
[6765922.877474] [<ffffffffa02cc18f>] ? rbd_remove+0x13c/0x156 [rbd]
[6765922.877526] [<ffffffff811607ec>] ? sysfs_write_file+0xe0/0x11c
[6765922.877573] [<ffffffff8110c0e5>] ? vfs_write+0xa4/0xff
[6765922.877622] [<ffffffff8110c231>] ? sys_write+0x58/0x92
[6765922.877673] [<ffffffff8136cbe9>] ? system_call_fastpath+0x16/0x1b
[6765922.877719] ---[ end trace c657e437db292283 ]---

History

#1 Updated by Ivan Kudryavtsev almost 11 years ago

probably connected with #4522

#2 Updated by Sage Weil almost 11 years ago

  • Project changed from Ceph to rbd

#3 Updated by Alex Elder almost 11 years ago

This is somewhat old code, and there are a few bugs
that have since been fixed that could be contributing
to this. There have also been some recent changes
that are not bugs that might have made the underlying
disappear.

That having been said...

You were evidently unmapping rbd device /dev/rbd/rbd/rbd83.
This caused the last reference to the device to be dropped,
which triggered a cascade of cleanup activity. One of the
actions was to remove a file from sysfs (something under/sys).
And in doing that, there was a problem because the entry in
a sysfs directory that was expected was not found, so
it couldn't be removed.

I don't know why it's basically hung at that point, but
this points at the problem, so I guess a hang is one
possible outcome.

I have a few questions:
- Can you describe, specifically, what version of the
kernel you are working with? It shows as "3.7.2-ceph"
it also shows the kernel is tainted, so it's not just
a stock 3.7.2 kernel.
- If possible, can you look in the System.map file associated
with your kernel for the symbol blk_trace_attr_group? I.e.,
something like:
grep blk_trace_attr_group /boot/System.map-3.7.2-ceph
- Were there any other unusual messages logged to the
console prior to this hang?
- Were you doing anything with blktrace?

#4 Updated by Ivan Kudryavtsev almost 11 years ago

I use vanilla 3.7.2 and built it with debian make-kpkg env.

root@tsk-vps-node-04:/sys# grep blk_trace_attr_group /boot/System.map-3.7.2-ceph
ffffffff8162d8c0 D blk_trace_attr_group

I'm not using blktrace. "Were you doing anything with blktrace?" don't understand question.

I met the situation again immediately after "rbd create" on "rbd map". This RBD wasn't even mapped prior. No messages in kernel logs, but rbd map in "D" state.

#5 Updated by Ivan Kudryavtsev almost 11 years ago

I added one more RBD device and mapped it, and it mapped ok, after I've tried to map again previous one and it is succeeded. But all previously hung commands still persist:

2734 ?        D      0:00 rbd map lun-000000031
4658 ? D 0:00 rbd --pool rbd map lun-legacy-vm1
5501 ? D 0:00 rbd map lun-legacy-vm-bw-web01
9709 ? D 0:00 rbd map lun-000000031
11261 ? D 0:00 rbd --pool rbd map lun-legacy-vm1
13729 ? D 0:00 rbd --pool rbd map lun-legacy-vm1
14467 ? D 0:00 rbd map lun-legacy-vm-bw-web01
32588 ? D 0:00 rbd --pool rbd map lun-legacy-bw-sirius

#6 Updated by Alex Elder almost 11 years ago

Quick summary: I don't think this is an rbd problem.

The reason I asked about "blktrace" is that the problem
you are hitting seems to be related to it.  The message
that got printed in your first report said:
sysfs group ffffffff8162d8c0 not found for kobject 'rbd83'

And you confirmed my suspicion that the hex value shown
there is in fact:
    ffffffff8162d8c0 D blk_trace_attr_group

Here is a summary of how we got to that spot in the code:
rbd_remove+0x13c/0x156 [rbd]
  kobject_release+0x48/0x5e
    device_release+0x4e/0x83
      rbd_dev_release+0xd4/0x14a [rbd]
    rbd_free_disk+0x1b/0x37 [rbd]
      del_gendisk+0x113/0x1c0
        blk_unregister_queue+0x60/0x6f
          sysfs_remove_group+0x26/0xbc
              <problems here>

Here's some more detail:

rbd_remove+0x13c/0x156 [rbd]
    This is due to the "rbd rm" command.  It ends up calling
    rbd_bus_dev_del(), which calls device_unregister(), which calls
    put_device(), which calls kobject_put(), which has here been
    called to drop the last release to the device, so it calls...
kobject_release+0x48/0x5e
    This calls kobject_cleanup(), which ultimately calls...
device_release+0x4e/0x83
    And this calls the release function we supplied, which is...
rbd_dev_release+0xd4/0x14a [rbd]
    This calls...
rbd_free_disk+0x1b/0x37 [rbd]
    Which calls...
del_gendisk+0x113/0x1c0
    Which calls...
blk_unregister_queue+0x60/0x6f
    And that calls blk_trace_remove_sysfs(), which is simply
    this:
    sysfs_remove_group(&dev->kobj, &blk_trace_attr_group);
sysfs_remove_group+0x26/0xbc
    This tries to get the sysfs entry for "trace" in the sysfs
    directory associated with the mapped device you're removing,
    and it doesn't find one.

I don't know why it's not found there.  When the device initially
got mapped, the add_disk() call should have set that sysfs entry up
in blk_trace_init_sysfs().

I suspect this isn't an rbd problem.  I did see some old
conversation about someone hitting something like this, with
followups from Greg Kroah-Hartman and Andrew Morton complaining
about the sysfs code...

#7 Updated by Sage Weil over 10 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF