Project

General

Profile

Actions

Bug #56417

open

async request completion event may get lost in a race with timeout

Added by Ilya Dryomov almost 2 years ago. Updated about 1 year ago.

Status:
In Progress
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This race was encountered during mirror snapshot creation in rbd_support module, but it appears to be a generic async request handling issue:

snap_create() API is called:

2022-06-27T20:35:15.156+0000 7fac6c6bd700  5 librbd::Operations: 0x56173f7fcb40 snap_create: snap_name=.mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56

The exclusive lock is held by remote rbd-mirror in this case, so the request is proxied there:

2022-06-27T20:35:15.171+0000 7fac6c6bd700 10 librbd::ImageWatcher: 0x56173e7be600 async request: [14184,94657932580480,61463]
2022-06-27T20:35:15.171+0000 7fac6c6bd700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463]
2022-06-27T20:35:15.171+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 remote snap_create request: [14184,94657932580480,61463] [mirror state=primary, complete=0, mirror_peer_uuids=ee892e62-9ea9-4075-a91c-5253869b4eb7, clean_since_snap_id=head] .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 0
2022-06-27T20:35:15.648+0000 7fac6cebe700 20 librbd::ImageWatcher: 0x56173e7be600 request progress: [14184,94657932580480,61463] @ 1/10
2022-06-27T20:35:15.648+0000 7fac6cebe700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463]

2022-06-27T20:35:44.156+0000 7fac6cebe700  5 librbd::Watcher: 0x56173e7be600 notifications_blocked: blocked=0
2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::Watcher::C_NotifyAck 0x56174a273b80 C_NotifyAck: id=158239480121843, handle=94657932580480                 <---------- likely .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 snapshot creation
2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 image header updated
2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173be51f80 handle_update_notification: refresh_seq = 1, last_refresh = 0
2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173ef97260 ImageUpdateWatchers::notify
2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173be51f80 flush_update_watchers
2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173ef97260 ImageUpdateWatchers::flush
2022-06-27T20:35:44.156+0000 7fac6cebe700 20 librbd::ImageState: 0x56173ef97260 ImageUpdateWatchers::flush: completing flush
2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x561749825250 C_ResponseMessage: r=0
2022-06-27T20:35:44.156+0000 7fac6cebe700 10 librbd::Watcher::C_NotifyAck 0x56174a273b80 finish: r=0

2022-06-27T20:35:45.648+0000 7fac45eab700  5 librbd::ImageWatcher: async request timed out: [14184,94657932580480,61463]                                        <---------- ImageWatcher::async_request_timed_out() calls remove_async_request(), request is erased from m_async_requests

2022-06-27T20:35:45.657+0000 7fac6c6bd700  5 librbd::Watcher: 0x56173e7be600 notifications_blocked: blocked=0
2022-06-27T20:35:45.657+0000 7fac6c6bd700 10 librbd::Watcher::C_NotifyAck 0x56174a19d7c0 C_NotifyAck: id=158243775089140, handle=94657932580480                 <---------- likely AsyncCompletePayload but request is not on m_async_requests so it is not completed, completion event is lost
2022-06-27T20:35:45.657+0000 7fac6c6bd700 10 librbd::Watcher::C_NotifyAck 0x56174a19d7c0 finish: r=0

From here the async request is repeatedly resent, only to be timed out:

2022-06-27T20:35:45.663+0000 7fac6c6bd700 10 librbd::ImageWatcher: 0x56173e7be600 async request: [14184,94657932580480,61463]
2022-06-27T20:35:45.663+0000 7fac6c6bd700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463]
2022-06-27T20:35:45.663+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 remote snap_create request: [14184,94657932580480,61463] [mirror state=primary, complete=0, mirror_peer_uuids=ee892e62-9ea9-4075-a91c-5253869b4eb7, clean_since_snap_id=head] .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 0

2022-06-27T20:36:15.664+0000 7fac45eab700  5 librbd::ImageWatcher: async request timed out: [14184,94657932580480,61463]

2022-06-27T20:36:15.680+0000 7fac6cebe700 10 librbd::ImageWatcher: 0x56173e7be600 async request: [14184,94657932580480,61463]
2022-06-27T20:36:15.680+0000 7fac6cebe700 20 librbd::ImageWatcher: scheduling async request time out: [14184,94657932580480,61463]
2022-06-27T20:36:15.681+0000 7fac6c6bd700 10 librbd::ImageWatcher: 0x56173e7be600 remote snap_create request: [14184,94657932580480,61463] [mirror state=primary, complete=0, mirror_peer_uuids=ee892e62-9ea9-4075-a91c-5253869b4eb7, clean_since_snap_id=head] .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56 0

[...]

This is because completed async requests tracking code introduced in https://github.com/ceph/ceph/pull/37012 and https://github.com/ceph/ceph/pull/37316 just drops requests that have already completed successfully on the floor: SnapCreatePayload is acked with r=0, AsyncCompletePayload is not resent.

As a result, snap_create() API ends up being blocked indefinitely even though the snapshot is actually created:

$ rbd snap ls --all mirror_test/set_1_image_44
SNAPID  NAME                                                                                       SIZE     PROTECTED  TIMESTAMP                 NAMESPACE                                                         
178585  .mirror.primary.d49e2338-c31c-4f22-861d-2ee9810d4d6f.2e909883-c7c2-48dc-bf23-f36c087ebb56  120 GiB             Mon Jun 27 20:35:20 2022  mirror (primary peer_uuids:[ee892e62-9ea9-4075-a91c-5253869b4eb7])
Actions #1

Updated by Ilya Dryomov over 1 year ago

  • Assignee set to Prasanna Kumar Kalever
Actions #2

Updated by Prasanna Kumar Kalever over 1 year ago

This is because completed async requests tracking code introduced in https://github.com/ceph/ceph/pull/37012 and https://github.com/ceph/ceph/pull/37316 just drops requests that have already completed successfully on the floor: SnapCreatePayload is acked with r=0, AsyncCompletePayload is not resent.

As a result, snap_create() API ends up being blocked indefinitely even though the snapshot is actually created:

@Mykola Golub, Sorry! I doubt I get the problem completely. Could you please elaborate on the problem and how the async event is expected to work and is working currently?

I tried reproducing this, but couldn't hit it locally. How should I be able to reproduce this, could you please provide the steps if there are any?

Also, Have we thought about any possible solution yet?

Actions #3

Updated by Ilya Dryomov over 1 year ago

Prasanna Kumar Kalever wrote:

I tried reproducing this, but couldn't hit it locally. How should I be able to reproduce this, could you please provide the steps if there are any?

Hi Prasanna,

It's a very tight race so you are unlikely to be able to reproduce it without a code change such as inserting some delays in some places. The async request needs to be time out, so one of the delays would have to be ~30 seconds.

Also, Have we thought about any possible solution yet?

One possible solution might be to massage ImageWatcher::async_request_timed_out() and the surrounding code so that the short window where the request is not on m_async_requests (and therefore can't be found, leading to AsyncCompletePayload message getting lost) is avoided. Another might be actually resend AsyncCompletePayload message, if feasible.

Actions #4

Updated by Ilya Dryomov about 1 year ago

  • Status changed from New to In Progress
Actions #5

Updated by Prasanna Kumar Kalever about 1 year ago

  • Pull request ID set to 49835
Actions

Also available in: Atom PDF