Bug #59393
closedrbd-mirror cannot grab lock in time with latency
0%
Description
In the case below, 30 ms latency is added on site-A. Image is being written on site-A and copied to site-B via rbd-mirror.
When rbd-mirror finishes transferring snapshot it cleans up old snapshot on source image via snap-remove.
If there is an active writer via krbd client, a proxied op is sent to that client to perform action.
In this case, krbd doesn't have snap-remove implemented to handle the operation locally.
You can see this interaction below:
2023-04-10T20:36:07.742+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 remote snap_remove request: .mirror.primary.46e80e27-6a14-4f31-a122-10a419b7f0e5.2f8744fa-070c-4981-8b90-0e494956f465^M 2023-04-10T20:36:07.742+0000 7f9375614640 20 librbd::ExclusiveLock: 0x560db12ff680 accept_request=0 (request_type=0)^M 2023-04-10T20:36:07.742+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db21915e0 finish: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::watcher::Notifier: 0x560db130a540 handle_notify: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::watcher::Notifier: 0x560db130a540 handle_notify: pending=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::image_watcher::NotifyLockOwner: 0x560db220fa80 handle_notify: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::image_watcher::NotifyLockOwner: 0x560db220fa80 handle_notify client responded with r=-95^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::ImageWatcher: remove_async_request: [4159,94616807323264,43]^M 2023-04-10T20:36:07.774+0000 7f9375e15640 20 librbd::Operations: handle_remote_request: r=-95^M 2023-04-10T20:36:07.774+0000 7f9375e15640 5 librbd::Operations: snap remove not supported by current lock owner^M
Next rbd-mirror will attempt to acquire lock to perform operation locally
The lock request (fails):
2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 acquire_lock: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 send_acquire_lock: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ExclusiveLock: 0x560db12ff680 pre_acquire_lock_handler^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 send_prepare_lock: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ImageState: 0x560db13120b0 prepare_lock^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ImageState: 0x560db13120b0 0x560db13120b0 send_prepare_lock_unlock^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 handle_prepare_lock: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 send_flush_notifies: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::exclusive_lock::PreAcquireRequest: 0x560db2151f80 handle_flush_notifies: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::ManagedLock: 0x560db12ff698 handle_pre_acquire_lock: r=0^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_get_locker: ^M 2023-04-10T20:36:07.774+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 send_get_lockers: ^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 handle_get_lockers: r=0^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 handle_get_lockers: retrieved exclusive locker: client.4162@192.168.195.10:0/2159339200^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d1f0 finish: r=0^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_get_locker: r=0^M 2023-04-10T20:36:07.805+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M 2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_lock: r=-16^M 2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_break_lock: ^M 2023-04-10T20:36:07.840+0000 7f9375e15640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 send_get_watchers: ^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: r=0^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: watcher=[addr=192.168.195.11:0/3825279451, entity=client.4159]^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: watcher=[addr=192.168.195.10:0/2159339200, entity=client.4162]^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 handle_get_watchers: lock owner is still alive^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::BreakRequest: 0x560db21dd400 finish: r=-11^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_break_lock: r=-11^M 2023-04-10T20:36:07.872+0000 7f9375614640 5 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_break_lock: lock owner is still alive^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ManagedLock: 0x560db12ff698 handle_acquire_lock: r=-11^M 2023-04-10T20:36:07.872+0000 7f9375614640 5 librbd::ManagedLock: 0x560db12ff698 handle_acquire_lock: unable to acquire exclusive lock^M 2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ExclusiveLock: 0x560db12ff680 post_acquire_lock_handler: r=-11^M
Then a notify is sent to lock owner to release lock and lock gets released:
2023-04-10T20:36:07.872+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 notify request lock^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::image_watcher::NotifyLockOwner: 0x560db2211600 send_notify^M 2023-04-10T20:36:07.872+0000 7f9375614640 20 librbd::watcher::Notifier: 0x560db130a540 notify: pending=1^M 2023-04-10T20:36:07.904+0000 7f9375e15640 5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M 2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::Watcher::C_NotifyAck 0x560db156d860 C_NotifyAck: id=292057776636, handle=94616807323264^M 2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::ImageWatcher: 0x560db130a480 exclusive lock requested^M 2023-04-10T20:36:07.904+0000 7f9375e15640 10 librbd::Watcher::C_NotifyAck 0x560db156d860 finish: r=0^M 2023-04-10T20:36:07.933+0000 7f9375614640 5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M 2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db15e5590 C_NotifyAck: id=292057776637, handle=94616807323264^M 2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 exclusive lock released^M 2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [0,0]^M
Again, lock request is issued by rbd-mirror client
2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ManagedLock: 0x560db12ff698 send_acquire_lock: ^M 2023-04-10T20:36:07.935+0000 7f9375614640 15 librbd::ImageWatcher: 0x560db130a480 requesting exclusive lock^M
Then it sends_get_locker request and when it gets a response the writer has already retaken the lock the image and the process will start over.
2023-04-10T20:36:07.935+0000 7f9375614640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 send_get_lockers: ^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 handle_get_lockers: r=0^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 handle_get_lockers: retrieved exclusive locker: client.4162@192.168.195.10:0/2159339200^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::GetLockerRequest: 0x560db238d260 finish: r=0^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_get_locker: r=0^M 2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M 2023-04-10T20:36:07.976+0000 7f9375614640 5 librbd::Watcher: 0x560db130a480 notifications_blocked: blocked=0^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::Watcher::C_NotifyAck 0x560db15e5590 C_NotifyAck: id=292057776638, handle=94616807323264^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 image exclusively locked announcement^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [4162,18446462598732840961]^M
Notice, between
2023-04-10T20:36:07.933+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [0,0]^M
and during send_lock/handle_lock, there is a 30ms+ delta which in this case is plenty of time to lose out on locking the image.
2023-04-10T20:36:07.970+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 send_lock: entity=client.4159, cookie=auto 94616807323264^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 image exclusively locked announcement^M 2023-04-10T20:36:07.976+0000 7f9375614640 10 librbd::ImageWatcher: 0x560db130a480 current lock owner: [4162,18446462598732840961]^M 2023-04-10T20:36:08.004+0000 7f9375e15640 10 librbd::managed_lock::AcquireRequest: 0x560db163ae10 handle_lock: r=-16^M
Files