Project

General

Profile

Actions

Bug #20379

closed

bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))

Added by Luis Henriques almost 7 years ago. Updated almost 7 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There's already a bug (with lots of dups) that seems to be what I'm seeing in a vstart.sh cluster. Since this bug is already closed (http://tracker.ceph.com/issues/19511) I've decided to open a new one.
The recipe is simple:
1. Start cluster with -b (I'm using -b -X -n --mon_num 1 --osd_num 3 --mds_num 1)
2. start a client (I'm using an SP3 kernel), mount the cephfs and run fio with a very simple script:

[random-writers]
rw=randrw
size=32m
numjobs=8

Running this script a few times will eventually kill the OSDs, changing the cluster status to HEALTH_WARN after start seeing kernel messages:

[ 74.536976] libceph: osd1 192.168.155.1:6804 socket closed (con state OPEN)
[ 74.538087] libceph: osd1 192.168.155.1:6804 socket error on write
[ 74.567434] libceph: osd2 192.168.155.1:6808 socket closed (con state OPEN)
[ 74.568229] libceph: osd2 192.168.155.1:6808 socket error on write
[ 74.907989] libceph: osd1 down
[ 74.908322] libceph: osd2 down
[ 82.912261] libceph: osd0 192.168.155.1:6800 socket closed (con state OPEN)
[ 82.914071] libceph: osd0 192.168.155.1:6800 socket closed (con state CONNECTING)
[ 84.037899] libceph: osd0 192.168.155.1:6800 socket error on write
[ 85.033905] libceph: osd0 192.168.155.1:6800 socket error on write
[ 87.037925] libceph: osd0 192.168.155.1:6800 socket error on write
[ 91.045943] libceph: osd0 192.168.155.1:6800 socket closed (con state CONNECTING)
[ 99.045865] libceph: osd0 192.168.155.1:6800 socket error on write
[ 115.077906] libceph: osd0 192.168.155.1:6800 socket error on write
[ 147.141919] libceph: osd0 192.168.155.1:6800 socket error on write

Looking at the (dead) OSD logs, I see:

-2> 2017-06-21 11:03:31.509411 7f531a1fd700 -1 bdev(0x558c1d4dcb40 /home/miguel/dev/ceph/ceph/build/dev/osd0/block) aio_submit retries 16
-1> 2017-06-21 11:03:31.509435 7f531a1fd700 -1 bdev(0x558c1d4dcb40 /home/miguel/dev/ceph/ceph/build/dev/osd0/block) aio submit got (11) Resource temporarily unavailable
0> 2017-06-21 11:03:31.512526 7f531a1fd700 -1 /home/miguel/dev/ceph/ceph/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7f531a1fd700 time 2017-06-21 11:03:31.509457
/home/miguel/dev/ceph/ceph/src/os/bluestore/KernelDevice.cc: 529: FAILED assert(r == 0)
ceph version 12.0.3-1919-g782b63ae9c (782b63ae9c1eba1d0eb61a1bed1a8874329944ca) luminous (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x558c13f0f6a5]
2: (KernelDevice::aio_submit(IOContext*)+0xb10) [0x558c13eaae30]
3: (BlueStore::_deferred_submit(BlueStore::OpSequencer*)+0x713) [0x558c13d6da03]
4: (BlueStore::_deferred_try_submit()+0x1c6) [0x558c13d6e356]
5: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x9c7) [0x558c13d82df7]
6: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0xba) [0x558c13d9366a]
7: (BlueStore::_kv_finalize_thread()+0xa0c) [0x558c13d951ec]
8: (BlueStore::KVFinalizeThread::entry()+0xd) [0x558c13dea66d]
9: (()+0x74e7) [0x7f532a78d4e7]
10: (clone()+0x3f) [0x7f5329800a2f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This is with current master branch.

My understanding is that this is just the IO queue being pushed a bit too hard, and the solution probably requires some sort of throttling mechanism.


Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #20381: bluestore: deferred aio submission can deadlock with completionResolvedSage Weil06/22/2017

Actions
Actions #1

Updated by Nathan Cutler almost 7 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Nathan Cutler almost 7 years ago

  • Has duplicate Bug #20381: bluestore: deferred aio submission can deadlock with completion added
Actions #3

Updated by Nathan Cutler almost 7 years ago

  • Priority changed from Normal to Urgent

Looks like the integration tests are hitting this as well.

Actions #4

Updated by John Spray almost 7 years ago

  • Subject changed from cephfs fio test kills bluestore vstart.sh cluster to bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))

Updated title to make it clear that this isn't specific to vstart

Actions #5

Updated by John Spray almost 7 years ago

  • Status changed from New to Duplicate

This ticket was opened first, but let's close it in favour of 20381 because that one has the integration test logs.

Actions #6

Updated by John Spray almost 7 years ago

  • Has duplicate deleted (Bug #20381: bluestore: deferred aio submission can deadlock with completion)
Actions #7

Updated by John Spray almost 7 years ago

  • Is duplicate of Bug #20381: bluestore: deferred aio submission can deadlock with completion added
Actions

Also available in: Atom PDF