Project

General

Profile

Actions

Bug #59224

open

Rbd-mirror lockdep issue when adding latency

Added by Christopher Hoffman about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run into lockdep issue investigating the added affects of latency.

   -4> 2023-03-29T18:00:00.418+0000 7f8087168640  0 btw, i am holding these locks:
    -3> 2023-03-29T18:00:00.418+0000 7f8087168640  0   librbd::ManagedLock<I>::m_lock (0x5621acea8258) (115)
    -2> 2023-03-29T18:00:00.418+0000 7f8087168640  0

    -1> 2023-03-29T18:00:00.446+0000 7f8087168640 -1 /ceph/src/common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int, bool, bool)' thread 7f8087168640 time 2023-03-29T18:00:00.418549+0000
/ceph/src/common/lockdep.cc: 337: ceph_abort_msg("abort() called")

 ceph version 18.0.0-3158-gf5598a8c9e9 (f5598a8c9e9841e359960dfad1d323e47b40ede0) reef (dev)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc8) [0x7f80993a9620]
 2: (lockdep_will_lock(char const*, int, bool, bool)+0x726) [0x7f80994340b8]
 3: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x17) [0x7f8099437209]
 4: (ceph::shared_mutex_debug::lock_shared()+0x25) [0x7f8099437ce7]
 5: (librbd::image::RefreshRequest<librbd::ImageCtx>::send_v2_get_mutable_metadata()+0x46) [0x5621a5bcfb70]
 6: (librbd::image::RefreshRequest<librbd::ImageCtx>::send()+0x1f) [0x5621a5bcfe1f]
 7: (librbd::ImageState<librbd::ImageCtx>::send_refresh_unlock()+0x9a) [0x5621a5adb68e]
 8: (librbd::ImageState<librbd::ImageCtx>::execute_next_action_unlock()+0x82) [0x5621a5adbbe0]
 9: (librbd::ImageState<librbd::ImageCtx>::complete_action_unlock(librbd::ImageState<librbd::ImageCtx>::State, int)+0x186) [0x5621a5adbf38]
 10: (librbd::ImageState<librbd::ImageCtx>::handle_prepare_lock_complete()+0x115) [0x5621a5adc9dd]
 11: (librbd::ExclusiveLock<librbd::ImageCtx>::post_acquire_lock_handler(int, Context*)+0x171) [0x5621a5abda45]
 12: (librbd::ManagedLock<librbd::ImageCtx>::handle_acquire_lock(int)::{lambda(int)#5}::operator()(int) const+0x26) [0x5621a5b0a9aa]
 13: (LambdaContext<librbd::ManagedLock<librbd::ImageCtx>::handle_acquire_lock(int)::{lambda(int)#5}>::finish(int)+0xd) [0x5621a5b0a9bd]
 14: (Context::complete(int)+0x9) [0x5621a59aa02f]
 15: (librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}::operator()() const+0x11) [0x5621a59aa54b]
 16: (boost::asio::detail::completion_handler<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x5d) [0x5621a59aecd9]
 17: (boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x5621a5ab840e]
 18: /ceph/build/lib/librados.so.2(+0xa2705) [0x7f809a074705]
 19: /ceph/build/lib/librados.so.2(+0xa2de1) [0x7f809a074de1]
 20: /ceph/build/lib/librados.so.2(+0xa50ec) [0x7f809a0770ec]
 21: /ceph/build/lib/librados.so.2(+0xa513e) [0x7f809a07713e]
 22: /ceph/build/lib/librados.so.2(+0xa5157) [0x7f809a077157]
 23: /lib64/libstdc++.so.6(+0xdbad3) [0x7f80982dbad3]
 24: /lib64/libc.so.6(+0x8cded) [0x7f8097e8cded]
 25: /lib64/libc.so.6(+0x112370) [0x7f8097f12370]

Was able to hit by:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.

step 1 can be done by running provided netns.sh

2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image

steps 2-4 can be done using "m-namespace.sh start"

5. On east, mapped rbd image to block device, formatted with xfs and mounted.

./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1

6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size

echo """

[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio

[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1

7. "injected" latency on the east(primary) network namespace using

tc qdisc add dev east root netem delay 30ms

8. Observe after one or two intervals the above stack trace.
This should be in a file in:

cat run/site-b/out/client.admin.<Pid of site-b rbd mirror>.log


Files

m-namespace.sh (2.53 KB) m-namespace.sh Christopher Hoffman, 04/05/2023 08:14 PM
netns.sh (1.53 KB) netns.sh Christopher Hoffman, 04/05/2023 08:17 PM
Actions #1

Updated by Christopher Hoffman about 1 year ago

  • Subject changed from Rbd-mirror lockdep issue to Rbd-mirror lockdep issue when adding latency
Actions #2

Updated by Ilya Dryomov about 1 year ago

Hi Chris,

Would you be able to make the reproducer script available? Having to stitch it from https://www.redhat.com/sysadmin/net-namespaces, src/mstart.sh commands, formatting XFS (whas is the size of the image?), a fio command for "R:40 IOPS, W:10 IOPS, 4K block size", etc is suboptimal.

Updated by Christopher Hoffman about 1 year ago

Actions

Also available in: Atom PDF