Project

General

Profile

Actions

Bug #20381

closed

bluestore: deferred aio submission can deadlock with completion

Added by John Spray almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
BlueStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-06-21T19:57:57.268 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:57:57.260862 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 10
2017-06-21T19:57:57.272 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:57:57.261862 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 1
2017-06-21T19:57:57.310 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:57:57.304575 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 8
2017-06-21T19:57:59.365 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:57:59.359316 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 14
2017-06-21T19:57:59.598 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:57:59.590665 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 8
2017-06-21T19:58:00.113 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:58:00.103374 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 12
2017-06-21T19:58:03.961 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:58:03.943611 7fe220634700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 7
2017-06-21T19:58:08.308 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:58:08.297819 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block) aio_submit retries 16
2017-06-21T19:58:08.308 INFO:tasks.ceph.osd.1.smithi161.stderr:2017-06-21 19:58:08.297822 7fe21bf73700 -1 bdev(0x7fe239a43e00 /var/lib/ceph/osd/ceph-1/block)  aio submit got (11) Resource temporarily unavailable
2017-06-21T19:58:09.211 INFO:tasks.ceph.osd.1.smithi161.stderr:/build/ceph-12.0.3-2007-g12a1512/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7fe21bf73700 time 2017-06-21 19:58:09.205751
2017-06-21T19:58:09.211 INFO:tasks.ceph.osd.1.smithi161.stderr:/build/ceph-12.0.3-2007-g12a1512/src/os/bluestore/KernelDevice.cc: 529: FAILED assert(r == 0)

Assertion: /build/ceph-12.0.3-2007-g12a1512/src/os/bluestore/KernelDevice.cc: 529: FAILED assert(r == 0)
ceph version 12.0.3-2007-g12a1512 (12a15124517d574a84a552ee2354738a066f45e4) luminous (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7fe22e6bffde]
 2: (KernelDevice::aio_submit(IOContext*)+0x5dd) [0x7fe22e6643dd]
 3: (BlueStore::_deferred_submit(BlueStore::OpSequencer*)+0x5d3) [0x7fe22e53f7b3]
 4: (BlueStore::_deferred_try_submit()+0x1cf) [0x7fe22e53ff8f]
 5: (BlueStore::_kv_finalize_thread()+0x815) [0x7fe22e569715]
 6: (BlueStore::KVFinalizeThread::entry()+0xd) [0x7fe22e5bd02d]
 7: (()+0x8184) [0x7fe22c1dc184]
 8: (clone()+0x6d) [0x7fe22b2cc37d]

http://pulpito.ceph.com/jspray-2017-06-21_17:52:06-fs-wip-jcsp-testing-20170621b-distro-basic-smithi/1312172

http://pulpito.ceph.com/jspray-2017-06-21_17:52:06-fs-wip-jcsp-testing-20170621b-distro-basic-smithi/1312322


Related issues 1 (0 open1 closed)

Has duplicate RADOS - Bug #20379: bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))Duplicate06/22/2017

Actions
Actions #1

Updated by Nathan Cutler almost 7 years ago

  • Is duplicate of Bug #20379: bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0)) added
Actions #2

Updated by Nathan Cutler almost 7 years ago

The backtrace looks exactly like the one in #20379 - duplicate?

Actions #3

Updated by John Spray almost 7 years ago

  • Status changed from New to Duplicate

This ticket was opened first, but let's close it in favour of 20381 because that one has the integration test logs.

Actions #4

Updated by John Spray almost 7 years ago

  • Is duplicate of deleted (Bug #20379: bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0)))
Actions #5

Updated by John Spray almost 7 years ago

  • Has duplicate Bug #20379: bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0)) added
Actions #6

Updated by John Spray almost 7 years ago

  • Status changed from Duplicate to New

Turns out when something is marked as a duplicate in redmine, it automatically closes this one when I close the other one! Reopening.

Actions #7

Updated by Sage Weil almost 7 years ago

  • Description updated (diff)
  • Status changed from New to 12
Actions #8

Updated by Sage Weil almost 7 years ago

  • Assignee set to Sage Weil

aio completion thread blocking on deferred_lock:

void BlueStore::_deferred_aio_finish(OpSequencer *osr)
{
  dout(10) << __func__ << " osr " << osr << dendl;
  assert(osr->deferred_running);
  DeferredBatch *b = osr->deferred_running;

  {
    std::lock_guard<std::mutex> l(deferred_lock);
    assert(osr->deferred_running == b);
    osr->deferred_running = nullptr;
    if (!osr->deferred_pending) {
      auto q = deferred_queue.iterator_to(*osr);
      deferred_queue.erase(q);
    } else if (deferred_aggressive) {
      _deferred_submit(osr);
    }
  }

while another thread is holding that lock and trying to submit deferred aio in _deferred_try_submit() > _deferred_submit(osr) -> bdev>aio_submit.

Actions #9

Updated by Sage Weil almost 7 years ago

  • Subject changed from bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0)) to bluestore: deferred aio submission can deadlock with completion
Actions #10

Updated by Sage Weil almost 7 years ago

Easy workaround is to make the aio queue really big.

Harder fix to do some complicated locking juggling. I worry about making the code even more complex, though. For now I'm just going to increase the aio queue (drastically).

Actions #11

Updated by Sage Weil almost 7 years ago

  • Status changed from 12 to 7
Actions #12

Updated by Sage Weil almost 7 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF