Bug #14780: librbd: TaskFinisher lifetime no longer matches ImageWatcher - rbd - Ceph

Actions

Copy link

Bug #14780

closed

librbd: TaskFinisher lifetime no longer matches ImageWatcher

Added by Josh Durgin about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Josh Durgin

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Since notify handling was made async from the librados threads in commit:d898995b0e3ea301b1325f68a0532d57afa3c816 tests can crash during image close when exclusive locking is enabled.

This occurs because flushing the watches no longer guarantees that all notifies have been completely handled, and since these are run from the TaskFinisher attached to the CephContext, notifies added to the TaskFinisher run after the ImageCtx they refer to has been destroyed. The notify for exclusive lock release runs into this in this case.

Looking into this also made me notice that sharing a single TaskFinisher is not safe currently since all events are cancelled by ImageWatcher::unregister_watch(), not just those scheduled by that image.

Example crash backtrace from test_rbd.py:

#0  ceph::log::SubsystemMap::should_gather (this=0x90, sub=15, level=10) at ./log/SubsystemMap.h:62
#1  0x00007fa1c523e2b8 in librados::IoCtxImpl::notify (this=0x2c3bec0, oid=..., bl=..., timeout_ms=<optimized out>, preply_bl=<optimized out>, preply_buf=<optimized out>, preply_buf_len=0x0)
    at librados/IoCtxImpl.cc:1332
#2  0x00007fa1c51f8447 in librados::IoCtx::notify2 (this=0x2c68360, oid=..., bl=..., timeout_ms=timeout_ms@entry=5000, preplybl=preplybl@entry=0x0) at librados/librados.cc:1827
#3  0x00007fa1ba2d0d6f in librbd::ImageWatcher::execute_released_lock (this=0x7fa16c010170) at librbd/ImageWatcher.cc:324
#4  0x00007fa1ba2d959a in operator() (a0=<optimized out>, this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
#5  FunctionContext::finish (this=<optimized out>, r=<optimized out>) at ./include/Context.h:460
#6  0x00007fa1ba296ae7 in Context::complete (this=0x7fa16c003990, r=0) at ./include/Context.h:64
#7  0x00007fa1ba296ae7 in Context::complete (this=0x7fa16c0096c0, r=0) at ./include/Context.h:64
#8  0x00007fa1ba3a70a6 in Finisher::finisher_thread_entry (this=0x7fa16c0026c0) at common/Finisher.cc:68
#9  0x00000031a7a07f33 in start_thread (arg=0x7fa18dffb700) at pthread_create.c:309
#10 0x00000031a76f4ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111