Bug #56417
openasync request completion event may get lost in a race with timeout
0%
Description
This race was encountered during mirror snapshot creation in rbd_support module, but it appears to be a generic async request handling issue:
snap_create() API is called:
2022-06-27T20:35:15.156+0000 7fac6c6bd700 5 librbd::Operations: 0x56173f7fcb40 snap_create: snap_name=.mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56
The exclusive lock is held by remote rbd-mirror in this case, so the request is proxied there:
2022-06-27T20:35:15.171+0000 7fac6c6bd700 10 librbd::ImageWatcher: 0x56173e7be600 async request: [14184,94657932580480,61463] 2022-06-27T20:35:15.171+0000 7fac6c6bd700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463] 2022-06-27T20:35:15.171+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 remote snap_create request: [14184,94657932580480,61463] [mirror state=primary, complete=0, mirror_peer_uuids=ee892e62-9ea9-4075-a91c-5253869b4eb7, clean_since_snap_id=head] .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 0 2022-06-27T20:35:15.648+0000 7fac6cebe700 20 librbd::ImageWatcher: 0x56173e7be600 request progress: [14184,94657932580480,61463] @ 1/10 2022-06-27T20:35:15.648+0000 7fac6cebe700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463] 2022-06-27T20:35:44.156+0000 7fac6cebe700 5 librbd::Watcher: 0x56173e7be600 notifications_blocked: blocked=0 2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::Watcher::C_NotifyAck 0x56174a273b80 C_NotifyAck: id=158239480121843, handle=94657932580480 <---------- likely .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 snapshot creation 2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 image header updated 2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173be51f80 handle_update_notification: refresh_seq = 1, last_refresh = 0 2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173ef97260 ImageUpdateWatchers::notify 2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173be51f80 flush_update_watchers 2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173ef97260 ImageUpdateWatchers::flush 2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173ef97260 ImageUpdateWatchers::flush: completing flush 2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x561749825250 C_ResponseMessage: r=0 2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::Watcher::C_NotifyAck 0x56174a273b80 finish: r=0 2022-06-27T20:35:45.648+0000 7fac45eab700 5 librbd::ImageWatcher: async request timed out: [14184,94657932580480,61463] <---------- ImageWatcher::async_request_timed_out() calls remove_async_request(), request is erased from m_async_requests 2022-06-27T20:35:45.657+0000 7fac6c6bd700 5 librbd::Watcher: 0x56173e7be600 notifications_blocked: blocked=0 2022-06-27T20:35:45.657+0000 7fac6c6bd700 10 librbd::Watcher::C_NotifyAck 0x56174a19d7c0 C_NotifyAck: id=158243775089140, handle=94657932580480 <---------- likely AsyncCompletePayload but request is not on m_async_requests so it is not completed, completion event is lost 2022-06-27T20:35:45.657+0000 7fac6c6bd700 10 librbd::Watcher::C_NotifyAck 0x56174a19d7c0 finish: r=0
From here the async request is repeatedly resent, only to be timed out:
2022-06-27T20:35:45.663+0000 7fac6c6bd700 10 librbd::ImageWatcher: 0x56173e7be600 async request: [14184,94657932580480,61463] 2022-06-27T20:35:45.663+0000 7fac6c6bd700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463] 2022-06-27T20:35:45.663+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 remote snap_create request: [14184,94657932580480,61463] [mirror state=primary, complete=0, mirror_peer_uuids=ee892e62-9ea9-4075-a91c-5253869b4eb7, clean_since_snap_id=head] .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 0 2022-06-27T20:36:15.664+0000 7fac45eab700 5 librbd::ImageWatcher: async request timed out: [14184,94657932580480,61463] 2022-06-27T20:36:15.680+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 async request: [14184,94657932580480,61463] 2022-06-27T20:36:15.680+0000 7fac6cebe700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463] 2022-06-27T20:36:15.681+0000 7fac6c6bd700 10 librbd::ImageWatcher: 0x56173e7be600 remote snap_create request: [14184,94657932580480,61463] [mirror state=primary, complete=0, mirror_peer_uuids=ee892e62-9ea9-4075-a91c-5253869b4eb7, clean_since_snap_id=head] .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 0 [...]
This is because completed async requests tracking code introduced in https://github.com/ceph/ceph/pull/37012 and https://github.com/ceph/ceph/pull/37316 just drops requests that have already completed successfully on the floor: SnapCreatePayload is acked with r=0, AsyncCompletePayload is not resent.
As a result, snap_create() API ends up being blocked indefinitely even though the snapshot is actually created:
$ rbd snap ls --all mirror_test/set_1_image_44 SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 178585 .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 120 GiB Mon Jun 27 20:35:20 2022 mirror (primary peer_uuids:[ee892e62-9ea9-4075-a91c-5253869b4eb7])