Bug #59224: Rbd-mirror lockdep issue when adding latency - rbd - Ceph

Actions

Copy link

Bug #59224

open

Rbd-mirror lockdep issue when adding latency

Added by Christopher Hoffman about 1 year ago. Updated about 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run into lockdep issue investigating the added affects of latency.

   -4> 2023-03-29T18:00:00.418+0000 7f8087168640  0 btw, i am holding these locks:
    -3> 2023-03-29T18:00:00.418+0000 7f8087168640  0   librbd::ManagedLock<I>::m_lock (0x5621acea8258) (115)
    -2> 2023-03-29T18:00:00.418+0000 7f8087168640  0

    -1> 2023-03-29T18:00:00.446+0000 7f8087168640 -1 /ceph/src/common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int, bool, bool)' thread 7f8087168640 time 2023-03-29T18:00:00.418549+0000
/ceph/src/common/lockdep.cc: 337: ceph_abort_msg("abort() called")

 ceph version 18.0.0-3158-gf5598a8c9e9 (f5598a8c9e9841e359960dfad1d323e47b40ede0) reef (dev)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc8) [0x7f80993a9620]
 2: (lockdep_will_lock(char const*, int, bool, bool)+0x726) [0x7f80994340b8]
 3: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x17) [0x7f8099437209]
 4: (ceph::shared_mutex_debug::lock_shared()+0x25) [0x7f8099437ce7]
 5: (librbd::image::RefreshRequest<librbd::ImageCtx>::send_v2_get_mutable_metadata()+0x46) [0x5621a5bcfb70]
 6: (librbd::image::RefreshRequest<librbd::ImageCtx>::send()+0x1f) [0x5621a5bcfe1f]
 7: (librbd::ImageState<librbd::ImageCtx>::send_refresh_unlock()+0x9a) [0x5621a5adb68e]
 8: (librbd::ImageState<librbd::ImageCtx>::execute_next_action_unlock()+0x82) [0x5621a5adbbe0]
 9: (librbd::ImageState<librbd::ImageCtx>::complete_action_unlock(librbd::ImageState<librbd::ImageCtx>::State, int)+0x186) [0x5621a5adbf38]
 10: (librbd::ImageState<librbd::ImageCtx>::handle_prepare_lock_complete()+0x115) [0x5621a5adc9dd]
 11: (librbd::ExclusiveLock<librbd::ImageCtx>::post_acquire_lock_handler(int, Context*)+0x171) [0x5621a5abda45]
 12: (librbd::ManagedLock<librbd::ImageCtx>::handle_acquire_lock(int)::{lambda(int)#5}::operator()(int) const+0x26) [0x5621a5b0a9aa]
 13: (LambdaContext<librbd::ManagedLock<librbd::ImageCtx>::handle_acquire_lock(int)::{lambda(int)#5}>::finish(int)+0xd) [0x5621a5b0a9bd]
 14: (Context::complete(int)+0x9) [0x5621a59aa02f]
 15: (librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}::operator()() const+0x11) [0x5621a59aa54b]
 16: (boost::asio::detail::completion_handler<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x5d) [0x5621a59aecd9]
 17: (boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x5621a5ab840e]
 18: /ceph/build/lib/librados.so.2(+0xa2705) [0x7f809a074705]
 19: /ceph/build/lib/librados.so.2(+0xa2de1) [0x7f809a074de1]
 20: /ceph/build/lib/librados.so.2(+0xa50ec) [0x7f809a0770ec]
 21: /ceph/build/lib/librados.so.2(+0xa513e) [0x7f809a07713e]
 22: /ceph/build/lib/librados.so.2(+0xa5157) [0x7f809a077157]
 23: /lib64/libstdc++.so.6(+0xdbad3) [0x7f80982dbad3]
 24: /lib64/libc.so.6(+0x8cded) [0x7f8097e8cded]
 25: /lib64/libc.so.6(+0x112370) [0x7f8097f12370]

Was able to hit by:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.

step 1 can be done by running provided netns.sh

2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image

steps 2-4 can be done using "m-namespace.sh start"

5. On east, mapped rbd image to block device, formatted with xfs and mounted.

./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1

6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size

echo """

[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio

[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1

7. "injected" latency on the east(primary) network namespace using

tc qdisc add dev east root netem delay 30ms

8. Observe after one or two intervals the above stack trace.
This should be in a file in:

cat run/site-b/out/client.admin.<Pid of site-b rbd mirror>.log

Files

Download all files

m-namespace.sh (2.53 KB) m-namespace.sh		Christopher Hoffman, 04/05/2023 08:14 PM
netns.sh (1.53 KB) netns.sh		Christopher Hoffman, 04/05/2023 08:17 PM

Actions

Copy link

Updated by Christopher Hoffman about 1 year ago

Subject changed from Rbd-mirror lockdep issue to Rbd-mirror lockdep issue when adding latency

Actions

Copy link

Updated by Ilya Dryomov about 1 year ago

Hi Chris,

Would you be able to make the reproducer script available? Having to stitch it from https://www.redhat.com/sysadmin/net-namespaces, src/mstart.sh commands, formatting XFS (whas is the size of the image?), a fio command for "R:40 IOPS, W:10 IOPS, 4K block size", etc is suboptimal.

Actions

Copy link Download all files

Updated by Christopher Hoffman about 1 year ago

File m-namespace.sh m-namespace.sh added
File netns.sh netns.sh added
Description updated (diff)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #59224

Rbd-mirror lockdep issue when adding latency

Updated by Christopher Hoffman about 1 year ago

Updated by Ilya Dryomov about 1 year ago

Updated by Christopher Hoffman about 1 year ago