Bug #2242
closedrbd: spinlock on wrong cpu
0%
Description
2012-04-04T01:17:25.100598-07:00 plana34 kernel: [ 9681.094759] BUG: spinlock wrong CPU on CPU#3, rbd/27814 2012-04-04T01:17:25.100614-07:00 plana34 kernel: [ 9681.100064] lock: ffffffffa031e900, .magic: dead4ead, .owner: rbd/27814, .owner_cpu: 5 2012-04-04T01:17:25.115582-07:00 plana34 kernel: [ 9681.108087] Pid: 27814, comm: rbd Not tainted 3.3.0-ceph-00066-g02615af #1 2012-04-04T01:17:25.115594-07:00 plana34 kernel: [ 9681.115026] Call Trace: 2012-04-04T01:17:25.123173-07:00 plana34 kernel: [ 9681.117488] [<ffffffff81323c28>] spin_dump+0x78/0xc0 2012-04-04T01:17:25.123185-07:00 plana34 kernel: [ 9681.122604] [<ffffffff81323c9b>] spin_bug+0x2b/0x40 2012-04-04T01:17:25.134065-07:00 plana34 kernel: [ 9681.127576] [<ffffffff81323d38>] do_raw_spin_unlock+0x88/0xb0 2012-04-04T01:17:25.134078-07:00 plana34 kernel: [ 9681.133477] [<ffffffff81615e6b>] _raw_spin_unlock+0x2b/0x40 2012-04-04T01:17:25.139794-07:00 plana34 kernel: [ 9681.139205] [<ffffffffa031ab92>] rbd_put_client+0x42/0x60 [rbd] 2012-04-04T01:17:25.152088-07:00 plana34 kernel: [ 9681.145220] [<ffffffffa031bb36>] rbd_dev_release+0xe6/0x170 [rbd] 2012-04-04T01:17:25.152101-07:00 plana34 kernel: [ 9681.151468] [<ffffffff813e95d7>] device_release+0x27/0xa0 2012-04-04T01:17:25.163323-07:00 plana34 kernel: [ 9681.156961] [<ffffffff81312ffd>] kobject_release+0x8d/0x1d0 2012-04-04T01:17:25.163336-07:00 plana34 kernel: [ 9681.162683] [<ffffffff81312e7c>] kobject_put+0x2c/0x60 2012-04-04T01:17:25.173775-07:00 plana34 kernel: [ 9681.167917] [<ffffffff813e9197>] put_device+0x17/0x20 2012-04-04T01:17:25.173788-07:00 plana34 kernel: [ 9681.173117] [<ffffffff813ea1ea>] device_unregister+0x2a/0x60 2012-04-04T01:17:25.179586-07:00 plana34 kernel: [ 9681.178928] [<ffffffffa031a22b>] rbd_remove+0x13b/0x170 [rbd] 2012-04-04T01:17:25.191007-07:00 plana34 kernel: [ 9681.184768] [<ffffffff813eb507>] bus_attr_store+0x27/0x30 2012-04-04T01:17:25.191020-07:00 plana34 kernel: [ 9681.190321] [<ffffffff811e9be6>] sysfs_write_file+0xe6/0x170 2012-04-04T01:17:25.201985-07:00 plana34 kernel: [ 9681.196077] [<ffffffff8117b0d8>] vfs_write+0xc8/0x190 2012-04-04T01:17:25.201998-07:00 plana34 kernel: [ 9681.201281] [<ffffffff8117b291>] sys_write+0x51/0x90 2012-04-04T01:17:25.213132-07:00 plana34 kernel: [ 9681.206342] [<ffffffff8161e1a9>] system_call_fastpath+0x16/0x1b
ubuntu@teuthology:/a/nightly_coverage_2012-04-04-a/4363
Updated by Alex Elder about 12 years ago
OK, I think this problem arises because of the switch to a spinlock to
protect the client list. Doing so was the right idea in principle, however
rbd_client_release() calls ceph_destroy-client(), which calls ceph_msgr_flush(),
which calls flush_workqueue(), which can sleep. We should not be holding a
spinlock in that case.
I think the fix is to move the spinlock deeper, within the rbd_client_release()
call, and make it surround just the list deletion where it's really needed.
This conclusion was reached after a pretty quick look so I plan to look a bit
more closely later.
Updated by Alex Elder about 12 years ago
- Status changed from New to Resolved
- Assignee set to Alex Elder
This was fixed a couple of weeks ago, and the result has been committed
both to the testing and master branches of the ceph-client tree. It should
also go to Linus in the next pull request (for 3.4).
commit cd9d9f5df6098c50726200d4185e9e8da32785b3
Author: Alex Elder <elder@dreamhost.com>
Date: Wed Apr 4 13:35:44 2012 -0500
rbd: don't hold spinlock during messenger flush