Bug #59224
openRbd-mirror lockdep issue when adding latency
0%
Description
Run into lockdep issue investigating the added affects of latency.
-4> 2023-03-29T18:00:00.418+0000 7f8087168640 0 btw, i am holding these locks: -3> 2023-03-29T18:00:00.418+0000 7f8087168640 0 librbd::ManagedLock<I>::m_lock (0x5621acea8258) (115) -2> 2023-03-29T18:00:00.418+0000 7f8087168640 0 -1> 2023-03-29T18:00:00.446+0000 7f8087168640 -1 /ceph/src/common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int, bool, bool)' thread 7f8087168640 time 2023-03-29T18:00:00.418549+0000 /ceph/src/common/lockdep.cc: 337: ceph_abort_msg("abort() called") ceph version 18.0.0-3158-gf5598a8c9e9 (f5598a8c9e9841e359960dfad1d323e47b40ede0) reef (dev) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc8) [0x7f80993a9620] 2: (lockdep_will_lock(char const*, int, bool, bool)+0x726) [0x7f80994340b8] 3: (ceph::mutex_debug_detail::mutex_debugging_base::_will_lock(bool)+0x17) [0x7f8099437209] 4: (ceph::shared_mutex_debug::lock_shared()+0x25) [0x7f8099437ce7] 5: (librbd::image::RefreshRequest<librbd::ImageCtx>::send_v2_get_mutable_metadata()+0x46) [0x5621a5bcfb70] 6: (librbd::image::RefreshRequest<librbd::ImageCtx>::send()+0x1f) [0x5621a5bcfe1f] 7: (librbd::ImageState<librbd::ImageCtx>::send_refresh_unlock()+0x9a) [0x5621a5adb68e] 8: (librbd::ImageState<librbd::ImageCtx>::execute_next_action_unlock()+0x82) [0x5621a5adbbe0] 9: (librbd::ImageState<librbd::ImageCtx>::complete_action_unlock(librbd::ImageState<librbd::ImageCtx>::State, int)+0x186) [0x5621a5adbf38] 10: (librbd::ImageState<librbd::ImageCtx>::handle_prepare_lock_complete()+0x115) [0x5621a5adc9dd] 11: (librbd::ExclusiveLock<librbd::ImageCtx>::post_acquire_lock_handler(int, Context*)+0x171) [0x5621a5abda45] 12: (librbd::ManagedLock<librbd::ImageCtx>::handle_acquire_lock(int)::{lambda(int)#5}::operator()(int) const+0x26) [0x5621a5b0a9aa] 13: (LambdaContext<librbd::ManagedLock<librbd::ImageCtx>::handle_acquire_lock(int)::{lambda(int)#5}>::finish(int)+0xd) [0x5621a5b0a9bd] 14: (Context::complete(int)+0x9) [0x5621a59aa02f] 15: (librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}::operator()() const+0x11) [0x5621a59aa54b] 16: (boost::asio::detail::completion_handler<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x5d) [0x5621a59aecd9] 17: (boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x66) [0x5621a5ab840e] 18: /ceph/build/lib/librados.so.2(+0xa2705) [0x7f809a074705] 19: /ceph/build/lib/librados.so.2(+0xa2de1) [0x7f809a074de1] 20: /ceph/build/lib/librados.so.2(+0xa50ec) [0x7f809a0770ec] 21: /ceph/build/lib/librados.so.2(+0xa513e) [0x7f809a07713e] 22: /ceph/build/lib/librados.so.2(+0xa5157) [0x7f809a077157] 23: /lib64/libstdc++.so.6(+0xdbad3) [0x7f80982dbad3] 24: /lib64/libc.so.6(+0x8cded) [0x7f8097e8cded] 25: /lib64/libc.so.6(+0x112370) [0x7f8097f12370]
Was able to hit by:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.
step 1 can be done by running provided netns.sh
2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image
steps 2-4 can be done using "m-namespace.sh start"
5. On east, mapped rbd image to block device, formatted with xfs and mounted.
./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1
6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size
echo """
[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio
[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1
7. "injected" latency on the east(primary) network namespace using
tc qdisc add dev east root netem delay 30ms
8. Observe after one or two intervals the above stack trace.
This should be in a file in:
cat run/site-b/out/client.admin.<Pid of site-b rbd mirror>.log
Files
Updated by Christopher Hoffman about 1 year ago
- Subject changed from Rbd-mirror lockdep issue to Rbd-mirror lockdep issue when adding latency
Updated by Ilya Dryomov about 1 year ago
Hi Chris,
Would you be able to make the reproducer script available? Having to stitch it from https://www.redhat.com/sysadmin/net-namespaces, src/mstart.sh commands, formatting XFS (whas is the size of the image?), a fio command for "R:40 IOPS, W:10 IOPS, 4K block size", etc is suboptimal.
Updated by Christopher Hoffman about 1 year ago
- File m-namespace.sh m-namespace.sh added
- File netns.sh netns.sh added
- Description updated (diff)