Bug #20379
closedbluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))
0%
Description
There's already a bug (with lots of dups) that seems to be what I'm seeing in a vstart.sh cluster. Since this bug is already closed (http://tracker.ceph.com/issues/19511) I've decided to open a new one.
The recipe is simple:
1. Start cluster with -b (I'm using -b -X -n --mon_num 1 --osd_num 3 --mds_num 1)
2. start a client (I'm using an SP3 kernel), mount the cephfs and run fio with a very simple script:
[random-writers]
rw=randrw
size=32m
numjobs=8
Running this script a few times will eventually kill the OSDs, changing the cluster status to HEALTH_WARN after start seeing kernel messages:
[ 74.536976] libceph: osd1 192.168.155.1:6804 socket closed (con state OPEN)
[ 74.538087] libceph: osd1 192.168.155.1:6804 socket error on write
[ 74.567434] libceph: osd2 192.168.155.1:6808 socket closed (con state OPEN)
[ 74.568229] libceph: osd2 192.168.155.1:6808 socket error on write
[ 74.907989] libceph: osd1 down
[ 74.908322] libceph: osd2 down
[ 82.912261] libceph: osd0 192.168.155.1:6800 socket closed (con state OPEN)
[ 82.914071] libceph: osd0 192.168.155.1:6800 socket closed (con state CONNECTING)
[ 84.037899] libceph: osd0 192.168.155.1:6800 socket error on write
[ 85.033905] libceph: osd0 192.168.155.1:6800 socket error on write
[ 87.037925] libceph: osd0 192.168.155.1:6800 socket error on write
[ 91.045943] libceph: osd0 192.168.155.1:6800 socket closed (con state CONNECTING)
[ 99.045865] libceph: osd0 192.168.155.1:6800 socket error on write
[ 115.077906] libceph: osd0 192.168.155.1:6800 socket error on write
[ 147.141919] libceph: osd0 192.168.155.1:6800 socket error on write
Looking at the (dead) OSD logs, I see:
-2> 2017-06-21 11:03:31.509411 7f531a1fd700 -1 bdev(0x558c1d4dcb40 /home/miguel/dev/ceph/ceph/build/dev/osd0/block) aio_submit retries 16
-1> 2017-06-21 11:03:31.509435 7f531a1fd700 -1 bdev(0x558c1d4dcb40 /home/miguel/dev/ceph/ceph/build/dev/osd0/block) aio submit got (11) Resource temporarily unavailable
0> 2017-06-21 11:03:31.512526 7f531a1fd700 -1 /home/miguel/dev/ceph/ceph/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7f531a1fd700 time 2017-06-21 11:03:31.509457
/home/miguel/dev/ceph/ceph/src/os/bluestore/KernelDevice.cc: 529: FAILED assert(r == 0)
ceph version 12.0.3-1919-g782b63ae9c (782b63ae9c1eba1d0eb61a1bed1a8874329944ca) luminous (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x558c13f0f6a5]
2: (KernelDevice::aio_submit(IOContext*)+0xb10) [0x558c13eaae30]
3: (BlueStore::_deferred_submit(BlueStore::OpSequencer*)+0x713) [0x558c13d6da03]
4: (BlueStore::_deferred_try_submit()+0x1c6) [0x558c13d6e356]
5: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x9c7) [0x558c13d82df7]
6: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0xba) [0x558c13d9366a]
7: (BlueStore::_kv_finalize_thread()+0xa0c) [0x558c13d951ec]
8: (BlueStore::KVFinalizeThread::entry()+0xd) [0x558c13dea66d]
9: (()+0x74e7) [0x7f532a78d4e7]
10: (clone()+0x3f) [0x7f5329800a2f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This is with current master branch.
My understanding is that this is just the IO queue being pushed a bit too hard, and the solution probably requires some sort of throttling mechanism.
Updated by Nathan Cutler almost 7 years ago
- Has duplicate Bug #20381: bluestore: deferred aio submission can deadlock with completion added
Updated by Nathan Cutler almost 7 years ago
- Priority changed from Normal to Urgent
Looks like the integration tests are hitting this as well.
Updated by John Spray almost 7 years ago
- Subject changed from cephfs fio test kills bluestore vstart.sh cluster to bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))
Updated title to make it clear that this isn't specific to vstart
Updated by John Spray almost 7 years ago
- Status changed from New to Duplicate
This ticket was opened first, but let's close it in favour of 20381 because that one has the integration test logs.
Updated by John Spray almost 7 years ago
- Has duplicate deleted (Bug #20381: bluestore: deferred aio submission can deadlock with completion)
Updated by John Spray almost 7 years ago
- Is duplicate of Bug #20381: bluestore: deferred aio submission can deadlock with completion added