Bug #16630
closedrbd map vs rbd_watch_errcb deadlock
0%
Description
On Fri, Jul 8, 2016 at 3:28 AM, Patrick McLean <patrickm@gaikai.com> wrote: > This is on linus git master as of 2016/07/01 > > [...] > > # grep -C 40 rbd /proc/*/stack > /proc/14109/stack-[<ffffffff8111c7c4>] __queue_work+0x144/0x420 > /proc/14109/stack-[<ffffffff8112c1c0>] default_wake_function+0x0/0x10 > /proc/14109/stack:[<ffffffffa01561ee>] > rbd_obj_method_sync.constprop.48+0x1be/0x290 [rbd] > /proc/14109/stack:[<ffffffffa0156aea>] rbd_dev_header_info+0x15a/0x930 [rbd] > /proc/14109/stack:[<ffffffffa0157450>] rbd_watch_cb+0x0/0xa0 [rbd] > /proc/14109/stack:[<ffffffffa0157586>] > rbd_dev_image_probe.part.42+0x96/0x910 [rbd] > /proc/14109/stack:[<ffffffffa015640e>] rbd_dev_image_id+0x14e/0x1b0 [rbd] > /proc/14109/stack:[<ffffffffa015828f>] do_rbd_add.isra.43+0x48f/0xbb0 [rbd] > /proc/14109/stack-[<ffffffff8123ba27>] __kmalloc+0x27/0x170 > /proc/14109/stack-[<ffffffff812b7f3a>] kernfs_fop_write+0x10a/0x190 > /proc/14109/stack-[<ffffffff8124dc63>] __vfs_write+0x23/0x120 > /proc/14109/stack-[<ffffffff8124e8f3>] vfs_write+0xb3/0x1a0 > /proc/14109/stack-[<ffffffff8124fbd2>] SyS_write+0x42/0xa0 > /proc/14109/stack-[<ffffffff81816f72>] entry_SYSCALL_64_fastpath+0x1a/0xa4 > /proc/14109/stack-[<ffffffffffffffff>] 0xffffffffffffffff > -- > /proc/29744/stack-[<ffffffff813c7c63>] call_rwsem_down_write_failed+0x13/0x20 > /proc/29744/stack:[<ffffffffa01572dd>] rbd_dev_refresh+0x1d/0xf0 [rbd] > /proc/29744/stack:[<ffffffffa0157413>] rbd_watch_errcb+0x33/0x70 [rbd] > /proc/29744/stack-[<ffffffffa0126a2e>] do_watch_error+0x2e/0x40 [libceph] > /proc/29744/stack-[<ffffffff8111d935>] process_one_work+0x145/0x3c0 > /proc/29744/stack-[<ffffffff8111dbfa>] worker_thread+0x4a/0x470 > /proc/29744/stack-[<ffffffff8111dbb0>] worker_thread+0x0/0x470 > /proc/29744/stack-[<ffffffff81122e4d>] kthread+0xbd/0xe0 > /proc/29744/stack-[<ffffffff8181717f>] ret_from_fork+0x1f/0x40 > /proc/29744/stack-[<ffffffff81122d90>] kthread+0x0/0xe0 > /proc/29744/stack-[<ffffffffffffffff>] 0xffffffffffffffff > -- > /proc/3426/stack-[<ffffffff8115ea31>] try_to_del_timer_sync+0x41/0x60 > /proc/3426/stack-[<ffffffff8115ea94>] del_timer_sync+0x44/0x50 > /proc/3426/stack-[<ffffffff8112c1c0>] default_wake_function+0x0/0x10 > /proc/3426/stack-[<ffffffff8111b35f>] flush_workqueue+0x12f/0x540 > /proc/3426/stack:[<ffffffffa015376b>] do_rbd_remove.isra.25+0xfb/0x190 [rbd] > /proc/3426/stack-[<ffffffff812b7f3a>] kernfs_fop_write+0x10a/0x190 > /proc/3426/stack-[<ffffffff8124dc63>] __vfs_write+0x23/0x120 > /proc/3426/stack-[<ffffffff8124e8f3>] vfs_write+0xb3/0x1a0 > /proc/3426/stack-[<ffffffff8124fbd2>] SyS_write+0x42/0xa0 > /proc/3426/stack-[<ffffffff81816f72>] entry_SYSCALL_64_fastpath+0x1a/0xa4 > /proc/3426/stack-[<ffffffffffffffff>] 0xffffffffffffffff > > # ps aux | egrep '(14109|29744|3426)' > root 3426 0.0 0.0 181256 10488 ? Dl Jul06 0:00 rbd > unmap /dev/rbd0 > root 14109 0.0 0.0 246704 10228 ? Sl Jul03 0:01 rbd > map --pool XXXX XXXXXXX > root 29744 0.0 0.0 0 0 ? D Jul05 0:00 [kworker/u16:2]
Files
Updated by Ilya Dryomov almost 8 years ago
- Status changed from New to Need More Info
ceph-devel thread: http://www.spinics.net/lists/ceph-devel/msg31257.html
On Wed, Jul 13, 2016 at 9:38 AM, Ilya Dryomov <idryomov@gmail.com> wrote: > On Wed, Jul 13, 2016 at 2:38 AM, Patrick McLean <patrickm@gaikai.com> wrote: >> On Sat, Jul 9, 2016 at 7:30 AM, Ilya Dryomov <idryomov@gmail.com> wrote: >>> >>> [...] >>> >>> Can you set "debug ms = 1" on all OSDs and try to reproduce? >>> I'd also like to see the content of /sys/kernel/debug/ceph/*/osdc when >>> it happens again. >> >> Here is the contents of the osdc debug file on a machine that had been >> in that state for 3 days. >> >> # cat /sys/kernel/debug/ceph/*/osdc >> REQUESTS 1 homeless 0 >> 7 osd38 4.27a8d388 [38,16,47]/38 [38,16,47]/38 rbd_header.bea4a16ebd6a9a 0x400011 1 0'0 call >> LINGER REQUESTS >> 1 osd38 4.27a8d388 [38,16,47]/38 [38,16,47]/38 rbd_header.bea4a16ebd6a9a 0x24 0 WC/0 > > OK, so it's waiting for a reply. Do you have a debug ms = 1 log for > osd38?
Updated by Victor Payno almost 8 years ago
- File dmesg-T.txt dmesg-T.txt added
- File dmesg-T-sysrq_t.txt dmesg-T-sysrq_t.txt added
- File dmesg-T-sysrq_w.txt dmesg-T-sysrq_w.txt added
- cat /sys/kernel/debug/ceph/*/osdc
REQUESTS 1 homeless 0
322452 osd2 37.a819967 [2]/2 [2]/2 rbd_header.d960431b30e2f 0x400019 3 0'0 call
LINGER REQUESTS
167 osd2 37.a819967 [2]/2 [2]/2 rbd_header.d960431b30e2f 0x24 2 WC/0
Updated by Victor Payno almost 8 years ago
- File ceph-osd.2.log.trimed.txt-aa ceph-osd.2.log.trimed.txt-aa added
- File ceph-osd.2.log.trimed.txt-ab ceph-osd.2.log.trimed.txt-ab added
- File ceph-osd.2.log.trimed.txt-ac ceph-osd.2.log.trimed.txt-ac added
- File ceph-osd.2.log.trimed.txt-ad ceph-osd.2.log.trimed.txt-ad added
- File ceph-osd.2.log.trimed.txt-ae ceph-osd.2.log.trimed.txt-ae added
- File ceph-osd.2.log.trimed.txt-af ceph-osd.2.log.trimed.txt-af added
- File ceph-osd.2.log.trimed.txt-ag ceph-osd.2.log.trimed.txt-ag added
- File ceph-osd.2.log.trimed.txt-ah ceph-osd.2.log.trimed.txt-ah added
- File ceph-osd.2.log.trimed.txt-ai ceph-osd.2.log.trimed.txt-ai added
Updated by Victor Payno almost 8 years ago
- File ceph-osd.2.log.trimed.txt-aj ceph-osd.2.log.trimed.txt-aj added
- File ceph-osd.2.log.trimed.txt-ak ceph-osd.2.log.trimed.txt-ak added
- File ceph-osd.2.log.trimed.txt-al ceph-osd.2.log.trimed.txt-al added
- File ceph-osd.2.log.trimed.txt-am ceph-osd.2.log.trimed.txt-am added
- File ceph-osd.2.log.trimed.txt-an ceph-osd.2.log.trimed.txt-an added
- File ceph-osd.2.log.trimed.txt-ao ceph-osd.2.log.trimed.txt-ao added
- File ceph-osd.2.log.trimed.txt-ap ceph-osd.2.log.trimed.txt-ap added
- File ceph-osd.2.log.trimed.txt-aq ceph-osd.2.log.trimed.txt-aq added
- File ceph-osd.2.log.trimed.txt-ar ceph-osd.2.log.trimed.txt-ar added
- File ceph-osd.2.log.trimed.txt-as ceph-osd.2.log.trimed.txt-as added
Updated by Victor Payno almost 8 years ago
- File ceph-osd.2.log.trimed.txt-at ceph-osd.2.log.trimed.txt-at added
- File ceph-osd.2.log.trimed.txt-au ceph-osd.2.log.trimed.txt-au added
- File ceph-osd.2.log.trimed.txt-av ceph-osd.2.log.trimed.txt-av added
- File ceph-osd.2.log.trimed.txt-aw ceph-osd.2.log.trimed.txt-aw added
- File ceph-osd.2.log.trimed.txt-ax ceph-osd.2.log.trimed.txt-ax added
- File ceph-osd.2.log.trimed.txt-ay ceph-osd.2.log.trimed.txt-ay added
- File ceph-osd.2.log.trimed.txt-az ceph-osd.2.log.trimed.txt-az added
- File ceph-osd.2.log.trimed.txt-ba ceph-osd.2.log.trimed.txt-ba added
- File ceph-osd.2.log.trimed.txt-bb ceph-osd.2.log.trimed.txt-bb added
- File ceph-osd.2.log.trimed.txt-bc ceph-osd.2.log.trimed.txt-bc added
Updated by Victor Payno almost 8 years ago
- File ceph-osd.2.log.trimed.txt-bd ceph-osd.2.log.trimed.txt-bd added
- File ceph-osd.2.log.trimed.txt-be ceph-osd.2.log.trimed.txt-be added
- File ceph-osd.2.log.trimed.txt-bf ceph-osd.2.log.trimed.txt-bf added
- File ceph-osd.2.log.trimed.txt-bg ceph-osd.2.log.trimed.txt-bg added
- File ceph-osd.2.log.trimed.txt-bh ceph-osd.2.log.trimed.txt-bh added
- File ceph-osd.2.log.trimed.txt-bi ceph-osd.2.log.trimed.txt-bi added
- File ceph-osd.2.log.trimed.txt-bj ceph-osd.2.log.trimed.txt-bj added
- File ceph-osd.2.log.trimed.txt-bk ceph-osd.2.log.trimed.txt-bk added
- File ceph-osd.2.log.trimed.txt-bl ceph-osd.2.log.trimed.txt-bl added
- File ceph-osd.2.log.trimed.txt-bm ceph-osd.2.log.trimed.txt-bm added
Updated by Victor Payno almost 8 years ago
- File ceph-osd.2.log.trimed.txt-bn ceph-osd.2.log.trimed.txt-bn added
- File ceph-osd.2.log.trimed.txt-bo ceph-osd.2.log.trimed.txt-bo added
Apologies for the 41 log files.
We were unable to upload the single file compressed with xz, gz or bzip2 (1.1MB, 1.9MB, 1.4MB).
The uncompressed file was also too large at 40MB.
Updated by Patrick McLean almost 8 years ago
I will also note here that enabling lockdep in the kernel on the client side makes the problem go away for some reason. No logs or reports from lockdep either.
Updated by Victor Payno almost 8 years ago
- File dmesg-full.txt-aa dmesg-full.txt-aa added
- File dmesg-full.txt-ab dmesg-full.txt-ab added
- File dmesg-full.txt-ac dmesg-full.txt-ac added
- File dmesg-full.txt-ad dmesg-full.txt-ad added
- File dmesg-full.txt-ae dmesg-full.txt-ae added
- File dmesg-full.txt-af dmesg-full.txt-af added
Same problem again. Couldn't upload a 5MB file. It reads "(Maximum size: 73.4 MB)" on the form.
Updated by Ilya Dryomov almost 8 years ago
Sorry about that - I've raised the upload issue internally. For now, just email the logs directly to me.
You've trimmed too much - I need the context for tid 322452, so I'm looking for "client.1246958" ":322452" or so. The easiest thing would be to compress and send the whole log from 2016-07-26 23:06:46 to 2016-07-27 00:56:59.
Also, just to validate the log - that image has hundreds of snapshots, right?
Updated by Victor Payno over 7 years ago
I'll attach the wider log to an email.
Yes, it has 697 snapshots. The overall average snapshot count per rdb is 355. The lowest number is 3 and the highest number is 708. Because of our load testing the number of snapshots could
Interestingly enough we can't map that rbd anymore.
$ rbd info test-rbd/3f370dbabff91bbb7ff23ae7a96e5cb414cac3408013cefed6d4b627b5eed9c7-willnotmap
rbd image '3f370dbabff91bbb7ff23ae7a96e5cb414cac3408013cefed6d4b627b5eed9c7-willnotmap':
size 9536 MB in 2385 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.11e601379f78e7
format: 2
features: layering, striping
flags:
stripe unit: 4096 kB
stripe count: 1
When we try to mount it, it hangs here:
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
open("/sys/bus/rbd/add_single_major", O_WRONLY) = 4
write(4, "192.168.2.63:6789,192.168.3.63:6789,"..., 147
Other rbds mount fine.
The cluster health is HEALTH_OK.
From Cliff:
Sifting through the osd2's log for "d960431b30e2f" I see this:
2016-07-27 00:47:47.199252 7f1ac7dbb700 1 -- 10.9.228.101:6814/25892 <== client.1252427 10.9.72.23:0/273648291 13 ==== osd_op(client.1252427.0:19 rbd_header.d960431b30e2f [call lock.lock] 37.a819967 ondisk+write+known_if_redirected e107453) v6 ==== 210+0+51 (4053684278 0 659001106) 0x55c9ec362000 con 0x55c9f0101600
2016-07-27 00:47:47.199515 7f1ad3539700 1 -- 10.9.228.101:6814/25892 --> 10.9.72.23:0/273648291 -- osd_op_reply(19 rbd_header.d960431b30e2f [call lock.lock] v0'0 uv0 ondisk = 16 ((16) Device or resource busy)) v6 - ?+0 0x55c9ee4b8b00 con 0x55c9f0101600
Updated by Victor Payno over 7 years ago
Because of our load testing the number of snapshots could
very well make it to 10k per rbd.
Updated by Ilya Dryomov over 7 years ago
That's exactly the problem.
For obscure implementation reasons, kernel client currently supports only 510 snapshots per image. Mapping an image with more than that hangs
[ 4986.527680] libceph: get_reply osd2 tid 322452 data 5332 > preallocated 4092, skipping
because the reply message is simply dropped on the floor and "rbd map" needs to be Ctrl-C'ed. I guess you just didn't notice it between all those ext4 warnings.
Lifting this limit is on the TODO list, but having this many snapshots is going to affect your bandwidth, among other things - snapshot contexts are transmitted with every write.
If you aren't just stress-testing and have a real-world use case, please share it.
Updated by Ilya Dryomov over 7 years ago
- Related to Feature #12874: krbd: get rid of RBD_MAX_SNAP_COUNT added
Updated by Victor Payno over 7 years ago
Thanks, I did see those messages but since there's so much noise from all the activity it was hard to go hey that's the problem specially since it looks harmless.
For our use case we're going to limit it to 300ish snapshots for now and we may end up going down to 50ish snapshots (which we do need them for various reasons) after we do some benchmarking.
As for the EXT4 errors we have code to fsck them as part of the mounting process, it's just disabled so we could see what kind of corruptions we could see. Most of them seem to be from all the hangs (that forced us to reset systems to unmap them) that have been damaging file systems.
Updated by Ilya Dryomov over 7 years ago
- Status changed from Need More Info to Closed