Project

General

Profile

Actions

Bug #59393

closed

rbd-mirror cannot grab lock in time with latency

Added by Christopher Hoffman about 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In the case below, 30 ms latency is added on site-A. Image is being written on site-A and copied to site-B via rbd-mirror.

When rbd-mirror finishes transferring snapshot it cleans up old snapshot on source image via snap-remove.

If there is an active writer via krbd client, a proxied op is sent to that client to perform action.
In this case, krbd doesn't have snap-remove implemented to handle the operation locally.

You can see this interaction below:

2023-04-10T20:36:07.742+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 remote snap_remove request: .mirror.primary.46e80e27-6a14-4f31-a122-10a419b7f0e5.2f8744fa-070c-4981-8b90-0e494956f465^M
2023-04-10T20:36:07.742+0000 7f9375614640 20 librbd::ExclusiveLock: 0x560db12ff680 accept_request=0 (request_type=0)^M
2023-04-10T20:36:07.742+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db21915e0 finish: r=0^M
2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::watcher::Notifier: 0x560db130a540 handle_notify: r=0^M
2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::watcher::Notifier: 0x560db130a540 handle_notify: pending=0^M
2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::image_watcher::NotifyLockOwner: 0x560db220fa80 handle_notify: r=0^M
2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::image_watcher::NotifyLockOwner: 0x560db220fa80 handle_notify client responded with r=-95^M
2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::ImageWatcher: remove_async_request: [4159,94616807323264,43]^M
2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::Operations: handle_remote_request: r=-95^M
2023-04-10T20:36:07.774+0000 7f9375e15640  5 librbd::Operations: snap remove not supported by current lock owner^M

Next rbd-mirror will attempt to acquire lock to perform operation locally

The lock request (fails):

2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 acquire_lock: ^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 send_acquire_lock: ^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ExclusiveLock: 0x560db12ff680 pre_acquire_lock_handler^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 send_prepare_lock: ^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ImageState: 0x560db13120b0 prepare_lock^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ImageState: 0x560db13120b0 0x560db13120b0 send_prepare_lock_unlock^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 handle_prepare_lock: r=0^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 send_flush_notifies: ^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 handle_flush_notifies: ^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 handle_pre_acquire_lock: r=0^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_get_locker: ^M
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 send_get_lockers: ^M
2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 handle_get_lockers: r=0^M
2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 handle_get_lockers: retrieved exclusive locker: client.4162@192.168.195.10:0/2159339200^M
2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 finish: r=0^M
2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_get_locker: r=0^M
2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M
2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_lock: r=-16^M
2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_break_lock: ^M
2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 send_get_watchers: ^M
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: r=0^M
2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: watcher=[addr=192.168.195.11:0/3825279451, entity=client.4159]^M
2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: watcher=[addr=192.168.195.10:0/2159339200, entity=client.4162]^M
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: lock owner is still alive^M
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 finish: r=-11^M
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_break_lock: r=-11^M
2023-04-10T20:36:07.872+0000 7f9375614640  5 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_break_lock: lock owner is still alive^M
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ManagedLock: 0x560db12ff698 handle_acquire_lock: r=-11^M
2023-04-10T20:36:07.872+0000 7f9375614640  5 librbd::ManagedLock: 0x560db12ff698 handle_acquire_lock: unable to acquire exclusive lock^M
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ExclusiveLock: 0x560db12ff680 post_acquire_lock_handler: r=-11^M

Then a notify is sent to lock owner to release lock and lock gets released:

2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 notify request lock^M
2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::image_watcher::NotifyLockOwner: 0x560db2211600 send_notify^M
2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::watcher::Notifier: 0x560db130a540 notify: pending=1^M
2023-04-10T20:36:07.904+0000 7f9375e15640  5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M
2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::Watcher::C_NotifyAck 0x560db156d860 C_NotifyAck: id=292057776636, handle=94616807323264^M
2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::ImageWatcher: 0x560db130a480 exclusive lock requested^M
2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::Watcher::C_NotifyAck 0x560db156d860 finish: r=0^M
2023-04-10T20:36:07.933+0000 7f9375614640  5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M
2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db15e5590 C_NotifyAck: id=292057776637, handle=94616807323264^M
2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 exclusive lock released^M
2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [0,0]^M

Again, lock request is issued by rbd-mirror client

2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ManagedLock: 0x560db12ff698 send_acquire_lock: ^M
2023-04-10T20:36:07.935+0000 7f9375614640 15 librbd::ImageWatcher: 0x560db130a480 requesting exclusive lock^M

Then it sends_get_locker request and when it gets a response the writer has already retaken the lock the image and the process will start over.

2023-04-10T20:36:07.935+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 send_get_lockers: ^M
2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 handle_get_lockers: r=0^M
2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 handle_get_lockers: retrieved exclusive locker: client.4162@192.168.195.10:0/2159339200^M
2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 finish: r=0^M
2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_get_locker: r=0^M
2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M
2023-04-10T20:36:07.976+0000 7f9375614640  5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M
2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db15e5590 C_NotifyAck: id=292057776638, handle=94616807323264^M
2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 image exclusively locked announcement^M
2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [4162,18446462598732840961]^M

Notice, between

2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [0,0]^M

and during send_lock/handle_lock, there is a 30ms+ delta which in this case is plenty of time to lose out on locking the image.

2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M
2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 image exclusively locked announcement^M
2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [4162,18446462598732840961]^M
2023-04-10T20:36:08.004+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_lock: r=-16^M

Files

old version (1.53 KB) old version Christopher Hoffman, 04/10/2023 09:49 PM
m-namespace.sh (2.53 KB) m-namespace.sh Christopher Hoffman, 04/10/2023 09:49 PM
log_snippet_59393.log (12.8 KB) log_snippet_59393.log Christopher Hoffman, 04/10/2023 09:57 PM
netns.sh (1.53 KB) netns.sh Christopher Hoffman, 04/12/2023 02:45 PM

Related issues 4 (1 open3 closed)

Related to rbd - Bug #59728: [test] single node "ip netns"-based latency injection for rbd-mirrorIn ProgressRamana Raja

Actions
Related to rbd - Bug #61225: TestClsRbd.mirror_snapshot failureRejectedIlya Dryomov

Actions
Copied to rbd - Backport #59709: quincy: rbd-mirror cannot grab lock in time with latencyResolvedIlya DryomovActions
Copied to rbd - Backport #59710: pacific: rbd-mirror cannot grab lock in time with latencyResolvedChristopher HoffmanActions

Updated by Christopher Hoffman about 1 year ago

Steps to reproduce:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.

step 1 can be done by running provided netns.sh

2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image

steps 2-4 can be done using "m-namespace.sh start"

5. On east, mapped rbd image to block device, formatted with xfs and mounted.

./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1

6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size

echo """

[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio

[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1

7. "injected" latency on the east(primary) network namespace using

tc qdisc add dev east root netem delay 30ms

Actions #3

Updated by Christopher Hoffman about 1 year ago

Actions #4

Updated by Christopher Hoffman about 1 year ago

  • Status changed from New to In Progress
Actions #5

Updated by Christopher Hoffman about 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 51166
Actions #6

Updated by Ilya Dryomov about 1 year ago

  • Assignee set to Christopher Hoffman
  • Backport set to pacific,quincy
Actions #7

Updated by Ilya Dryomov 12 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Backport Bot 12 months ago

  • Copied to Backport #59709: quincy: rbd-mirror cannot grab lock in time with latency added
Actions #9

Updated by Backport Bot 12 months ago

  • Copied to Backport #59710: pacific: rbd-mirror cannot grab lock in time with latency added
Actions #10

Updated by Backport Bot 12 months ago

  • Tags set to backport_processed
Actions #11

Updated by Ilya Dryomov 12 months ago

  • Related to Bug #59728: [test] single node "ip netns"-based latency injection for rbd-mirror added
Actions #12

Updated by Ilya Dryomov 12 months ago

  • Related to Bug #61225: TestClsRbd.mirror_snapshot failure added
Actions #13

Updated by Backport Bot 10 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF