Project

General

Profile

Bug #3925

krbd: sysfs write lockdep warnings

Added by Alex Elder about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While testing with the new rbd request code, I'm seeing lockdep
warnings like the following *I think* as the "rbd.xfstests" task
is setting up.  I only started seeing this when I switched to a
different cluster of machines, and perhaps there is a slight
difference in the timing behavior.  This also may be the only task
or workunit that sets up multiple rbd image mappings, so that may
also offer some explanation.

It involves the sysfs files (/sys/bus/rbd/*) and that stuff has
not really changed, so it's possible this is an old problem that's
getting exercised differently now.

[  525.498895] 
[  525.516878] ======================================================
[  525.541046] [ INFO: possible circular locking dependency detected ]
[  525.565642] 3.6.0-ceph-00222-g505c658 #1 Not tainted
[  525.588899] -------------------------------------------------------
[  525.614355] tee/17751 is trying to acquire lock:
[  525.638159]  (ctl_mutex/1){+.+.+.}, at: [<ffffffffa03af727>] rbd_dev_refresh+0x47/0x2f0 [rbd]
[  525.687269] 
[  525.687269] but task is already holding lock:
[  525.733113]  (s_active#92){++++.+}, at: [<ffffffff811ec66d>] sysfs_write_file+0xcd/0x170
[  525.784400] 
[  525.784400] which lock already depends on the new lock.
[  525.784400] 
[  525.854564] 
[  525.854564] the existing dependency chain (in reverse order) is:
[  525.903568] 
-> #1 (s_active#92){++++.+}:
[  525.951843]        [<ffffffff810b2e22>] lock_acquire+0xa2/0x140
[  525.981549]        [<ffffffff811ed896>] sysfs_deactivate+0x116/0x160
[  526.011722]        [<ffffffff811ee37b>] sysfs_addrm_finish+0x3b/0x70
[  526.041742]        [<ffffffff811ec3cb>] sysfs_hash_and_remove+0x5b/0xb0
[  526.072621]        [<ffffffff811f01c1>] remove_files.isra.1+0x31/0x40
[  526.103685]        [<ffffffff811f05bd>] sysfs_remove_group+0x4d/0x100
[  526.135077]        [<ffffffff813f67eb>] device_remove_groups+0x3b/0x60
[  526.167361]        [<ffffffff813f6ae4>] device_remove_attrs+0x44/0x80
[  526.199721]        [<ffffffff813f73d5>] device_del+0x125/0x1c0
[  526.231522]        [<ffffffff813f7492>] device_unregister+0x22/0x60
[  526.263840]        [<ffffffffa03adb4f>] rbd_remove+0x1bf/0x1d0 [rbd]
[  526.296184]        [<ffffffff813f8b97>] bus_attr_store+0x27/0x30
[  526.327702]        [<ffffffff811ec686>] sysfs_write_file+0xe6/0x170
[  526.359090]        [<ffffffff8117bee3>] vfs_write+0xb3/0x180
[  526.389392]        [<ffffffff8117c20a>] sys_write+0x4a/0x90
[  526.418770]        [<ffffffff8163e1e9>] system_call_fastpath+0x16/0x1b
[  526.448865] 
-> #0 (ctl_mutex/1){+.+.+.}:
[  526.497322]        [<ffffffff810b2788>] __lock_acquire+0x1ac8/0x1b90
[  526.526997]        [<ffffffff810b2e22>] lock_acquire+0xa2/0x140
[  526.556295]        [<ffffffff8163253b>] mutex_lock_nested+0x4b/0x320
[  526.585443]        [<ffffffffa03af727>] rbd_dev_refresh+0x47/0x2f0 [rbd]
[  526.614405]        [<ffffffffa03af9ef>] rbd_image_refresh+0x1f/0x40 [rbd]
[  526.643916]        [<ffffffff813f6128>] dev_attr_store+0x18/0x30
[  526.672582]        [<ffffffff811ec686>] sysfs_write_file+0xe6/0x170
[  526.701641]        [<ffffffff8117bee3>] vfs_write+0xb3/0x180
[  526.730530]        [<ffffffff8117c20a>] sys_write+0x4a/0x90
[  526.758747]        [<ffffffff8163e1e9>] system_call_fastpath+0x16/0x1b
[  526.787895] 
[  526.787895] other info that might help us debug this:
[  526.787895] 
[  526.861473]  Possible unsafe locking scenario:
[  526.861473] 
[  526.911036]        CPU0                    CPU1
[  526.937266]        ----                    ----
[  526.963037]   lock(s_active#92);
[  526.987106]                                lock(ctl_mutex/1);
[  527.014601]                                lock(s_active#92);
[  527.041835]   lock(ctl_mutex/1);
[  527.066248] 
[  527.066248]  *** DEADLOCK ***
[  527.066248] 
[  527.132154] 2 locks held by tee/17751:
[  527.155697]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811ec5e4>] sysfs_write_file+0x44/0x170
[  527.207512]  #1:  (s_active#92){++++.+}, at: [<ffffffff811ec66d>] sysfs_write_file+0xcd/0x170
[  527.261343] 
[  527.261343] stack backtrace:
[  527.308416] Pid: 17751, comm: tee Not tainted 3.6.0-ceph-00222-g505c658 #1
[  527.337963] Call Trace:
[  527.362447]  [<ffffffff8162b274>] print_circular_bug+0x1fb/0x20c
[  527.391626]  [<ffffffff81085fec>] ? ttwu_stat+0x4c/0x140
[  527.419942]  [<ffffffff810b2788>] __lock_acquire+0x1ac8/0x1b90
[  527.448916]  [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10
[  527.477852]  [<ffffffffa03af727>] ? rbd_dev_refresh+0x47/0x2f0 [rbd]
[  527.507874]  [<ffffffff810b2e22>] lock_acquire+0xa2/0x140
[  527.536802]  [<ffffffffa03af727>] ? rbd_dev_refresh+0x47/0x2f0 [rbd]
[  527.566606]  [<ffffffff8163253b>] mutex_lock_nested+0x4b/0x320
[  527.595791]  [<ffffffffa03af727>] ? rbd_dev_refresh+0x47/0x2f0 [rbd]
[  527.625661]  [<ffffffff810b28f0>] ? lock_release_non_nested+0xa0/0x310
[  527.655945]  [<ffffffffa03af727>] rbd_dev_refresh+0x47/0x2f0 [rbd]
[  527.686315]  [<ffffffffa03af9ef>] rbd_image_refresh+0x1f/0x40 [rbd]
[  527.717062]  [<ffffffff811ec66d>] ? sysfs_write_file+0xcd/0x170
[  527.748060]  [<ffffffff813f6128>] dev_attr_store+0x18/0x30
[  527.778749]  [<ffffffff811ec686>] sysfs_write_file+0xe6/0x170
[  527.809654]  [<ffffffff8117bee3>] vfs_write+0xb3/0x180
[  527.840222]  [<ffffffff8117c20a>] sys_write+0x4a/0x90
[  527.870203]  [<ffffffff8163e1e9>] system_call_fastpath+0x16/0x1b
[  532.309485] rbd: rbd3: obj_request read result -6
[  532.309485] 
[  532.368026] end_request: I/O error, dev rbd3, sector 0
[  532.400245] quiet_error: 48 callbacks suppressed
[  532.431396] Buffer I/O error on device rbd3, logical block 0
[  532.463331] Buffer I/O error on device rbd3, logical block 1
[  532.493904] Buffer I/O error on device rbd3, logical block 2
[  532.523110] Buffer I/O error on device rbd3, logical block 3
[  532.552207] rbd: rbd3: obj_request read result -6
[  532.552207] 
[  532.604603] end_request: I/O error, dev rbd3, sector 0
[  532.632642] Buffer I/O error on device rbd3, logical block 0

History

#1 Updated by Alex Elder almost 11 years ago

I was working on trying to argue that this problem goes
away when the sysfs files for snapshots go away:
http://tracker.ceph.com/issues/4774
rbd: set up devices only for mapped images

But in the process I think I may have found the problem,
which would remain even with the above patch in place.

I believe the lockdep problem is:
- When unmapping an rbd image, we (the rbd CLI) write something
into /sys/bus/rbd/remove:
- that first acquires the ctl_mutex
- then later, sysfs acquires its s_active lock for an
entry that needs to be removed.

- Meanwhile, it is possible to force rbd to refresh its header
information by writing to /sys/bus/rbd/devices/<dev_id>/refresh:
- sysfs acquires the s_active lock to write to the file
- writing to the "refresh" entry causes rbd_dev_refresh()
to be called, and that acquires ctl_mutex

So the locks can be taken in different orders, and can thus
lead to deadlock.

The "rbd/kernel.sh" workunit uses the "refresh" sysfs file,
and that is probably the test that's leading to this report.
In addition (and probably worse), an osd event due to a watch
on an rbd header object will also cause rbd_dev_refresh()
to be called, so we have the same problem but due to an
event from the osd. This might cause an actual deadlock
if a device was getting unmapped right after deleting a
snapshot.

If this is right, it should be easy to reproduce this with a
lockdep-enabled kernel. Simply using the "refresh" entry for
a mapped rbd image is probably sufficient.

Thinking about a solution... Refreshing the image header
info is done under protection of the rbd device's header
semaphore. We don't want the device to go away while
a refresh is underway, and once a refresh has been called
at all we need to respond to it.

It may be possible to leverage the mechanism we use for
mutually blocking unmaps from in-use mappings and maps
from being-deleted mappings to lock this out. And in
doing so we could avoid using the ctl_mutex for refresh.

That is, we'd effectively have a refresh show up as an
open of a device which ends when the refresh is complete.

#2 Updated by Ian Colle almost 11 years ago

  • Assignee set to Alex Elder

#3 Updated by Alex Elder almost 11 years ago

I have tried to reproduce this a bunch of times, both
manually (as I described, using the refresh sysfs file)
and using the "kernel.sh" workunit and the lockdep
problem.

I am seeing a problem at the end of running the
"kernel.sh" script (during cleanup). Not sure what
that's about, I'll look a little closer.

#4 Updated by Alex Elder almost 11 years ago

Oh, now I know what's happening. The "kernel.sh" script
was looking at the snapshot sysfs files, which are no
longer present. I'll fix that up...

#5 Updated by Alex Elder almost 11 years ago

I just committed the following change to the
rbd/kernel.sh workunit in the "master" branch
of the ceph git repository.

    rbd/kernel.sh: quit looking for snapshot sysfs entries

    The sysfs entries for snapshots went away a while ago, and this
    script used them to verify sizes matched what was expected.

    Instead, look at the mapped size of the snapshot in the places
    that used to look for the image's snapshot sysfs files.

    Also, switch over to using "udevadm settle" rather than a delay to
    wait for udev to do its thing.  Insert them at more appropriate
    places--right after "rmd map" commands and before and after the
    "rbd unmap" calls.

    Stop doing the manual refresh calls as well.  The osd will trigger
    refreshes whenever the image size or shapshot context changes.

    Finally, the cleanup routine is called initially, when there really
    isn't expected to be anything to clean up.  Change the rbd commands
    to run there conditionally, only if the target of the command
    already exists.

    Signed-off-by: Alex Elder <elder@inktank.com>

#6 Updated by Alex Elder almost 11 years ago

  • Status changed from New to In Progress

Since I've been unable to reproduce this problem with
current code, I'm going to try reproducing it using
code that preceded the removal of the sysfs code for
snapshots.

#7 Updated by Alex Elder almost 11 years ago

Well that was fun. I reproduced the problem immediately with:

rbd create image1 --size 1024
rbd snap create image1@snap1
rbd map image1@snap1
echo 1 > /sys/bus/rbd/devices/1/refresh

Now I'm going to move one patch forward in the history, to
the one that deletes the sysfs files for snapshots, and
will try again. If that doesn't reproduce quickly I'm
going to call this bug fixed.

#8 Updated by Alex Elder almost 11 years ago

  • Status changed from In Progress to Resolved

I have my answer. The problem does not show up
now that the snapshot sysfs files are gone.

I'm marking this bug resolved (maybe that's not
exactly right, but the problem has gone away).

#9 Updated by Alex Elder almost 11 years ago

  • Status changed from Resolved to In Progress

Well shit.

I unmapped my image and I got a lockdep error.
I'll look some more tomorrow.

[  372.012153] ======================================================
[  372.018437] [ INFO: possible circular locking dependency detected ]
[  372.024753] 3.9.0-ceph-00191-g3e83b65 #1 Not tainted
[  372.029765] -------------------------------------------------------
[  372.036082] rbd/2363 is trying to acquire lock:
[  372.040661]  (s_active#65){++++.+}, at: [<ffffffff811fdaab>] sysfs_addrm_finish+0x3b/0x70
[  372.049200] 
[  372.049200] but task is already holding lock:
[  372.055090]  (ctl_mutex/1){+.+.+.}, at: [<ffffffffa030f07e>] rbd_remove+0x5e/0x210 [rbd]
[  372.063491] 
[  372.063491] which lock already depends on the new lock.
[  372.063491] 
[  372.071739] 
[  372.071739] the existing dependency chain (in reverse order) is:
[  372.079287] 
-> #1 (ctl_mutex/1){+.+.+.}:
[  372.083704]        [<ffffffff810b8162>] lock_acquire+0xa2/0x140
[  372.089739]        [<ffffffff8166e73b>] mutex_lock_nested+0x4b/0x320
[  372.096194]        [<ffffffffa0311627>] rbd_dev_refresh+0x47/0x2a0 [rbd]
[  372.102994]        [<ffffffffa031189f>] rbd_image_refresh+0x1f/0x40 [rbd]
[  372.109890]        [<ffffffff81421af8>] dev_attr_store+0x18/0x30
[  372.116039]        [<ffffffff811fbd16>] sysfs_write_file+0xe6/0x170
[  372.122405]        [<ffffffff811868f3>] vfs_write+0xb3/0x180
[  372.128174]        [<ffffffff81186d95>] sys_write+0x55/0xa0
[  372.133837]        [<ffffffff8167b559>] system_call_fastpath+0x16/0x1b
[  372.140466] 
-> #0 (s_active#65){++++.+}:
[  372.144875]        [<ffffffff810b7b8e>] __lock_acquire+0x1d8e/0x1d90
[  372.151371]        [<ffffffff810b8162>] lock_acquire+0xa2/0x140
[  372.157381]        [<ffffffff811fcf96>] sysfs_deactivate+0x116/0x160
[  372.163827]        [<ffffffff811fdaab>] sysfs_addrm_finish+0x3b/0x70
[  372.170272]        [<ffffffff811fba5b>] sysfs_hash_and_remove+0x5b/0xb0
[  372.177011]        [<ffffffff811ff951>] remove_files.isra.1+0x31/0x40
[  372.183551]        [<ffffffff811ffdcd>] sysfs_remove_group+0x4d/0x100
[  372.190135]        [<ffffffff8142245b>] device_remove_groups+0x3b/0x60
[  372.196805]        [<ffffffff81422794>] device_remove_attrs+0x44/0x80
[  372.203352]        [<ffffffff81422edd>] device_del+0x11d/0x1b0
[  372.209285]        [<ffffffff81422f92>] device_unregister+0x22/0x60
[  372.215673]        [<ffffffffa030f1fc>] rbd_remove+0x1dc/0x210 [rbd]
[  372.222124]        [<ffffffff81424677>] bus_attr_store+0x27/0x30
[  372.228229]        [<ffffffff811fbd16>] sysfs_write_file+0xe6/0x170
[  372.234588]        [<ffffffff811868f3>] vfs_write+0xb3/0x180
[  372.240363]        [<ffffffff81186d95>] sys_write+0x55/0xa0
[  372.246028]        [<ffffffff8167b559>] system_call_fastpath+0x16/0x1b
[  372.252657] 
[  372.252657] other info that might help us debug this:
[  372.252657] 
[  372.260756]  Possible unsafe locking scenario:
[  372.260756] 
[  372.266757]        CPU0                    CPU1
[  372.271332]        ----                    ----
[  372.275905]   lock(ctl_mutex/1);
[  372.279329]                                lock(s_active#65);
[  372.285279]                                lock(ctl_mutex/1);
[  372.291266]   lock(s_active#65);
[  372.294696] 
[  372.294696]  *** DEADLOCK ***
[  372.294696] 
[  372.300692] 3 locks held by rbd/2363:
[  372.304403]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811fbc74>] sysfs_write_file+0x44/0x170
[  372.313496]  #1:  (s_active#67){.+.+.+}, at: [<ffffffff811fbcfd>] sysfs_write_file+0xcd/0x170
[  372.322426]  #2:  (ctl_mutex/1){+.+.+.}, at: [<ffffffffa030f07e>] rbd_remove+0x5e/0x210 [rbd]
[  372.331400] 
[  372.331400] stack backtrace:
[  372.335818] Pid: 2363, comm: rbd Not tainted 3.9.0-ceph-00191-g3e83b65 #1
[  372.342653] Call Trace:
[  372.345150]  [<ffffffff816680d5>] print_circular_bug+0x1fb/0x20c
[  372.351201]  [<ffffffff810b7b8e>] __lock_acquire+0x1d8e/0x1d90
[  372.357084]  [<ffffffff810b3fd3>] ? lockdep_init_map+0xa3/0x560
[  372.363051]  [<ffffffff811fdaab>] ? sysfs_addrm_finish+0x3b/0x70
[  372.369113]  [<ffffffff810b8162>] lock_acquire+0xa2/0x140
[  372.374560]  [<ffffffff811fdaab>] ? sysfs_addrm_finish+0x3b/0x70
[  372.380627]  [<ffffffff811fcf96>] sysfs_deactivate+0x116/0x160
[  372.386506]  [<ffffffff811fdaab>] ? sysfs_addrm_finish+0x3b/0x70
[  372.392562]  [<ffffffff811fdaab>] sysfs_addrm_finish+0x3b/0x70
[  372.398445]  [<ffffffff811fba5b>] sysfs_hash_and_remove+0x5b/0xb0
[  372.404585]  [<ffffffff811ff951>] remove_files.isra.1+0x31/0x40
[  372.410561]  [<ffffffff811ffdcd>] sysfs_remove_group+0x4d/0x100
[  372.416531]  [<ffffffff8142245b>] device_remove_groups+0x3b/0x60
[  372.422585]  [<ffffffff81422794>] device_remove_attrs+0x44/0x80
[  372.428562]  [<ffffffff81422edd>] device_del+0x11d/0x1b0
[  372.433923]  [<ffffffff81422f92>] device_unregister+0x22/0x60
[  372.439737]  [<ffffffffa030f1fc>] rbd_remove+0x1dc/0x210 [rbd]
[  372.445617]  [<ffffffff81424677>] bus_attr_store+0x27/0x30
[  372.451150]  [<ffffffff811fbd16>] sysfs_write_file+0xe6/0x170
[  372.456947]  [<ffffffff811868f3>] vfs_write+0xb3/0x180
[  372.462134]  [<ffffffff81186d95>] sys_write+0x55/0xa0
[  372.467236]  [<ffffffff8167b559>] system_call_fastpath+0x16/0x1b

#10 Updated by Alex Elder almost 11 years ago

That sequence reproduces the problem, even in the latest
version of the "testing" branch. (Not all of it may be
required, but this is enough.)

Namely:

rbd create image1 --size 1024
rbd snap create image1@snap1
rbd map image1@snap1
echo 1 > /sys/bus/rbd/devices/1/refresh
rbd unmap /dev/rbd/rbd/image1@snap1

The lockdep splat happens on unmap.

#11 Updated by Alex Elder almost 11 years ago

  • Status changed from In Progress to Fix Under Review

I found that avoiding taking the ctl_lock when updating
getting or putting device references got rid of the
problem. I went further than that though, and cleaned
up (actually, mostly eliminated) the use of the control
lock.

These patches have been posted for review:

[PATCH 1/5] rbd: set removing flag while holding list lock
[PATCH 2/5] rbd: protect against concurrent unmaps
[PATCH 3/5] rbd: don't hold ctl_mutex to get/put device

#12 Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF