Bug #59393
closedrbd-mirror cannot grab lock in time with latency
0%
Description
In the case below, 30 ms latency is added on site-A. Image is being written on site-A and copied to site-B via rbd-mirror.
When rbd-mirror finishes transferring snapshot it cleans up old snapshot on source image via snap-remove.
If there is an active writer via krbd client, a proxied op is sent to that client to perform action.
In this case, krbd doesn't have snap-remove implemented to handle the operation locally.
You can see this interaction below:
2023-04-10T20:36:07.742+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 remote snap_remove request: .mirror.primary.46e80e27-6a14-4f31-a122-10a419b7f0e5.2f8744fa-070c-4981-8b90-0e494956f465^M 2023-04-10T20:36:07.742+0000 7f9375614640 20 librbd::ExclusiveLock: 0x560db12ff680 accept_request=0 (request_type=0)^M 2023-04-10T20:36:07.742+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db21915e0 finish: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::watcher::Notifier: 0x560db130a540 handle_notify: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::watcher::Notifier: 0x560db130a540 handle_notify: pending=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::image_watcher::NotifyLockOwner: 0x560db220fa80 handle_notify: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::image_watcher::NotifyLockOwner: 0x560db220fa80 handle_notify client responded with r=-95^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::ImageWatcher: remove_async_request: [4159,94616807323264,43]^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::Operations: handle_remote_request: r=-95^M 2023-04-10T20:36:07.774+0000 7f9375e15640 5 librbd::Operations: snap remove not supported by current lock owner^M
Next rbd-mirror will attempt to acquire lock to perform operation locally
The lock request (fails):
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 acquire_lock: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 send_acquire_lock: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ExclusiveLock: 0x560db12ff680 pre_acquire_lock_handler^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 send_prepare_lock: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ImageState: 0x560db13120b0 prepare_lock^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ImageState: 0x560db13120b0 0x560db13120b0 send_prepare_lock_unlock^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 handle_prepare_lock: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 send_flush_notifies: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 handle_flush_notifies: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 handle_pre_acquire_lock: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_get_locker: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 send_get_lockers: ^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 handle_get_lockers: r=0^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 handle_get_lockers: retrieved exclusive locker: client.4162@192.168.195.10:0/2159339200^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 finish: r=0^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_get_locker: r=0^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M 2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_lock: r=-16^M 2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_break_lock: ^M 2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 send_get_watchers: ^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: r=0^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: watcher=[addr=192.168.195.11:0/3825279451, entity=client.4159]^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: watcher=[addr=192.168.195.10:0/2159339200, entity=client.4162]^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: lock owner is still alive^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 finish: r=-11^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_break_lock: r=-11^M 2023-04-10T20:36:07.872+0000 7f9375614640 5 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_break_lock: lock owner is still alive^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ManagedLock: 0x560db12ff698 handle_acquire_lock: r=-11^M 2023-04-10T20:36:07.872+0000 7f9375614640 5 librbd::ManagedLock: 0x560db12ff698 handle_acquire_lock: unable to acquire exclusive lock^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ExclusiveLock: 0x560db12ff680 post_acquire_lock_handler: r=-11^M
Then a notify is sent to lock owner to release lock and lock gets released:
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 notify request lock^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::image_watcher::NotifyLockOwner: 0x560db2211600 send_notify^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::watcher::Notifier: 0x560db130a540 notify: pending=1^M 2023-04-10T20:36:07.904+0000 7f9375e15640 5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M 2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::Watcher::C_NotifyAck 0x560db156d860 C_NotifyAck: id=292057776636, handle=94616807323264^M 2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::ImageWatcher: 0x560db130a480 exclusive lock requested^M 2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::Watcher::C_NotifyAck 0x560db156d860 finish: r=0^M 2023-04-10T20:36:07.933+0000 7f9375614640 5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M 2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db15e5590 C_NotifyAck: id=292057776637, handle=94616807323264^M 2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 exclusive lock released^M 2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [0,0]^M
Again, lock request is issued by rbd-mirror client
2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ManagedLock: 0x560db12ff698 send_acquire_lock: ^M 2023-04-10T20:36:07.935+0000 7f9375614640 15 librbd::ImageWatcher: 0x560db130a480 requesting exclusive lock^M
Then it sends_get_locker request and when it gets a response the writer has already retaken the lock the image and the process will start over.
2023-04-10T20:36:07.935+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 send_get_lockers: ^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 handle_get_lockers: r=0^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 handle_get_lockers: retrieved exclusive locker: client.4162@192.168.195.10:0/2159339200^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 finish: r=0^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_get_locker: r=0^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M 2023-04-10T20:36:07.976+0000 7f9375614640 5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db15e5590 C_NotifyAck: id=292057776638, handle=94616807323264^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 image exclusively locked announcement^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [4162,18446462598732840961]^M
Notice, between
2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [0,0]^M
and during send_lock/handle_lock, there is a 30ms+ delta which in this case is plenty of time to lose out on locking the image.
2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 image exclusively locked announcement^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [4162,18446462598732840961]^M 2023-04-10T20:36:08.004+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_lock: r=-16^M
Files
Updated by Christopher Hoffman about 1 year ago
- File old version old version added
- File m-namespace.sh m-namespace.sh added
Steps to reproduce:
1. Configured East/West sites with Open vSwitch (Example 2):
https://www.redhat.com/sysadmin/net-namespaces. The key part was different network namespaces.
step 1 can be done by running provided netns.sh
2. Started mstart on site-a on East and a separate mstart on site-b on west.
3. On east, created a single image with mirror based snapshotting
4. Setup 1m interval for image
steps 2-4 can be done using "m-namespace.sh start"
5. On east, mapped rbd image to block device, formatted with xfs and mounted.
./bin/rbd --cluster site-a device map pool1/test-demote-sb
mkfs.xfs /dev/rbd0 <--- block device that was mapped above
mount /dev/rbd0 /mnt/latency1
6. On east, ran continuous workload using fio for random read/writes R:40 IOPS, W:10 IOPS, 4K block size
echo """
[global]
refill_buffers
time_based=1
size=5g
direct=1
group_reporting
ioengine=libaio
[workload]
rw=randrw
rate_iops=40,10
blocksize=4KB
#norandommap
iodepth=4
numjobs=1
runtime=2d
""" >> /mnt/latency1/smallIO_test1
cd /mnt/latency1
fio smallIO_test1
7. "injected" latency on the east(primary) network namespace using
tc qdisc add dev east root netem delay 30ms
Updated by Christopher Hoffman about 1 year ago
- File log_snippet_59393.log log_snippet_59393.log added
Updated by Christopher Hoffman about 1 year ago
- Status changed from New to In Progress
Updated by Christopher Hoffman about 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 51166
Updated by Ilya Dryomov about 1 year ago
- Assignee set to Christopher Hoffman
- Backport set to pacific,quincy
Updated by Ilya Dryomov 12 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 12 months ago
- Copied to Backport #59709: quincy: rbd-mirror cannot grab lock in time with latency added
Updated by Backport Bot 12 months ago
- Copied to Backport #59710: pacific: rbd-mirror cannot grab lock in time with latency added
Updated by Ilya Dryomov 12 months ago
- Related to Bug #59728: [test] single node "ip netns"-based latency injection for rbd-mirror added
Updated by Ilya Dryomov 12 months ago
- Related to Bug #61225: TestClsRbd.mirror_snapshot failure added
Updated by Backport Bot 10 months ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".