Bug #20653
bluestore: aios don't complete on very large writes on xenial
0%
Description
cpach:ceph-deploy (master) 03:06 PM $ dpkg -l linux-image-generic Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-============-============-================================= ii linux-image-ge 4.4.0.83.89 amd64 Generic Linux kernel image cpach:ceph-deploy (master) 03:07 PM $ uname -a Linux cpach 4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 17:49:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux cpach:ceph-deploy (master) 03:07 PM $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04 Codename: xenial
40x ~12MB osdmap writes, resulting in ~9000 (?) aios, but we stop draining them about halfway through.
(worked around by reducing osd_map_message_max to 5.)
History
#1 Updated by Sage Weil over 6 years ago
2017-07-16 14:57:18.653415 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c8810 r 65536 ioc 0x55dedbc2ee00 with 4183 aios left 2017-07-16 14:57:18.653417 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827bda000~10000 2017-07-16 14:57:18.653419 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c8950 r 65536 ioc 0x55dedbc2ee00 with 4182 aios left 2017-07-16 14:57:18.653421 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827bea000~10000 2017-07-16 14:57:18.653423 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c8a90 r 65536 ioc 0x55dedbc2ee00 with 4181 aios left 2017-07-16 14:57:18.653425 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827bfa000~10000 2017-07-16 14:57:18.653427 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c8bd0 r 65536 ioc 0x55dedbc2ee00 with 4180 aios left 2017-07-16 14:57:18.653429 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827c0a000~10000 2017-07-16 14:57:18.653430 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c8d10 r 65536 ioc 0x55dedbc2ee00 with 4179 aios left 2017-07-16 14:57:18.653432 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827c1a000~10000 2017-07-16 14:57:18.653434 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c8e50 r 65536 ioc 0x55dedbc2ee00 with 4178 aios left 2017-07-16 14:57:18.653436 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827c2a000~10000 2017-07-16 14:57:18.653438 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c8f90 r 65536 ioc 0x55dedbc2ee00 with 4177 aios left 2017-07-16 14:57:18.653440 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827c3a000~10000 2017-07-16 14:57:18.653442 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c90d0 r 65536 ioc 0x55dedbc2ee00 with 4176 aios left 2017-07-16 14:57:18.653443 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827c4a000~10000 2017-07-16 14:57:18.653445 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c9210 r 65536 ioc 0x55dedbc2ee00 with 4175 aios left 2017-07-16 14:57:18.653447 7f846a967700 20 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_log_finish 1 0x6827c5a000~10000 2017-07-16 14:57:18.653448 7f846a967700 10 bdev(0x55dec129a900 /var/lib/ceph/osd/ceph-11/block) _aio_thread finished aio 0x55defc6c9350 r 65536 ioc 0x55dedbc2ee00 with 4174 aios left
then nothing from that thread.
#2 Updated by Josh Durgin over 6 years ago
- Priority changed from Urgent to Immediate
Seeing the same thing in many jobs in these runs, but not just on xenial. The first one I looked at was trusty - osd.0 log saved at:
A bunch of bluestore jobs hit this in these runs:
http://pulpito.ceph.com/joshd-2017-07-27_01:19:46-rados-wip-dup-ops-debug-distro-basic-smithi/
http://pulpito.ceph.com/joshd-2017-07-26_22:34:59-rados-wip-dup-ops-debug-distro-basic-smithi/
#3 Updated by Josh Durgin over 6 years ago
This may be a different bug, but it appears to be bluestore causing a rados aio test to time out (with full logs saved):
http://pulpito.ceph.com/joshd-2017-07-27_01:19:46-rados-wip-dup-ops-debug-distro-basic-smithi/1449564/
http://pulpito.ceph.com/joshd-2017-07-27_01:19:46-rados-wip-dup-ops-debug-distro-basic-smithi/1449648/
#4 Updated by Sage Weil over 6 years ago
- Priority changed from Immediate to Urgent
Those last two failures are due to #20771 fixed by dfab9d9b5d75d0f87053b1a3727f62da72af6c91
I haven't been able to reproduce the large hangs with a ceph_test_objectstore tool on my xenial box (the same one where i saw the original hang). :(
#5 Updated by Sage Weil over 6 years ago
- Status changed from 12 to Need More Info
#6 Updated by Sage Weil over 6 years ago
- Priority changed from Urgent to High
#7 Updated by Sage Weil over 6 years ago
- Status changed from Need More Info to Can't reproduce
I'm going to assume this was #21171