Project

General

Profile

Bug #14780

librbd: TaskFinisher lifetime no longer matches ImageWatcher

Added by Josh Durgin about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
Start date:
02/17/2016
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Since notify handling was made async from the librados threads in commit:d898995b0e3ea301b1325f68a0532d57afa3c816 tests can crash during image close when exclusive locking is enabled.

This occurs because flushing the watches no longer guarantees that all notifies have been completely handled, and since these are run from the TaskFinisher attached to the CephContext, notifies added to the TaskFinisher run after the ImageCtx they refer to has been destroyed. The notify for exclusive lock release runs into this in this case.

Looking into this also made me notice that sharing a single TaskFinisher is not safe currently since all events are cancelled by ImageWatcher::unregister_watch(), not just those scheduled by that image.

Example crash backtrace from test_rbd.py:

#0  ceph::log::SubsystemMap::should_gather (this=0x90, sub=15, level=10) at ./log/SubsystemMap.h:62
#1  0x00007fa1c523e2b8 in librados::IoCtxImpl::notify (this=0x2c3bec0, oid=..., bl=..., timeout_ms=<optimized out>, preply_bl=<optimized out>, preply_buf=<optimized out>, preply_buf_len=0x0)
    at librados/IoCtxImpl.cc:1332
#2  0x00007fa1c51f8447 in librados::IoCtx::notify2 (this=0x2c68360, oid=..., bl=..., timeout_ms=timeout_ms@entry=5000, preplybl=preplybl@entry=0x0) at librados/librados.cc:1827
#3  0x00007fa1ba2d0d6f in librbd::ImageWatcher::execute_released_lock (this=0x7fa16c010170) at librbd/ImageWatcher.cc:324
#4  0x00007fa1ba2d959a in operator() (a0=<optimized out>, this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
#5  FunctionContext::finish (this=<optimized out>, r=<optimized out>) at ./include/Context.h:460
#6  0x00007fa1ba296ae7 in Context::complete (this=0x7fa16c003990, r=0) at ./include/Context.h:64
#7  0x00007fa1ba296ae7 in Context::complete (this=0x7fa16c0096c0, r=0) at ./include/Context.h:64
#8  0x00007fa1ba3a70a6 in Finisher::finisher_thread_entry (this=0x7fa16c0026c0) at common/Finisher.cc:68
#9  0x00000031a7a07f33 in start_thread (arg=0x7fa18dffb700) at pthread_create.c:309
#10 0x00000031a76f4ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Associated revisions

Revision a2c90b51 (diff)
Added by Josh Durgin about 3 years ago

Revert "librbd: use task finisher per CephContext"

Since notify handling was made async from the librados threads in
d898995b0e3ea301b1325f68a0532d57afa3c816 tests can crash during
image close when exclusive locking is enabled.

This occurs because flushing the watches no longer guarantees that all
notifies have been completely handled, and since these are run from
the TaskFinisher attached to the CephContext, notifies added to the
TaskFinisher run after the ImageCtx they refer to has been
destroyed. The notify for exclusive lock release runs into this in
this case.

Looking into this also made me notice that sharing a single
TaskFinisher is not safe currently since all events are cancelled by
ImageWatcher::unregister_watch(), not just those scheduled by that
image.

Example crash backtrace from test_rbd.py:

at librados/IoCtxImpl.cc:1332

This reverts commit 96563c15159d1ba0e0978e76b8df6a8ab311e5d2.

Fixes: #14780
Signed-off-by: Josh Durgin <>

History

#1 Updated by Josh Durgin about 3 years ago

  • Status changed from New to Need Review
  • Assignee set to Josh Durgin

#2 Updated by Josh Durgin about 3 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF