Project

General

Profile

Actions

Bug #38463

open

[rbd-mirror] thrasher can lead to "File exists" error when creating image

Added by Jason Dillaman about 5 years ago. Updated about 5 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-02-23T16:21:44.059 INFO:tasks.rbd_mirror.cluster2.client.mirror.0.smithi033.stderr:2019-02-23 16:21:44.057 7f6746ffd700 -1 rbd::mirror::image_replayer::CreateImageRequest: 0x7f669a797a90 handle_create_image: failed to create local image: (17) File exists
2019-02-23T16:21:44.059 INFO:tasks.rbd_mirror.cluster2.client.mirror.0.smithi033.stderr:2019-02-23 16:21:44.057 7f6746ffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f6764036570 handle_create_local_image: failed to create local image: (17) File exists
2019-02-23T16:21:44.063 INFO:tasks.rbd_mirror.cluster2.client.mirror.0.smithi033.stderr:2019-02-23 16:21:44.057 7f6746ffd700 -1 rbd::mirror::image_replayer::CreateImageRequest: 0x7f669a7a83b0 handle_create_image: failed to create local image: (17) File exists
2019-02-23T16:21:44.064 INFO:tasks.rbd_mirror.cluster2.client.mirror.0.smithi033.stderr:2019-02-23 16:21:44.057 7f6746ffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f67640dab10 handle_create_local_image: failed to create local image: (17) File exists
2019-02-23T16:21:44.064 INFO:tasks.rbd_mirror.cluster2.client.mirror.0.smithi033.stderr:2019-02-23 16:21:44.057 7f6746ffd700 -1 rbd::mirror::image_replayer::CreateImageRequest: 0x7f669a7e3af0 handle_create_image: failed to create local image: (17) File exists
2019-02-23T16:21:44.064 INFO:tasks.rbd_mirror.cluster2.client.mirror.0.smithi033.stderr:2019-02-23 16:21:44.057 7f6746ffd700 -1 rbd::mirror::image_replayer::BootstrapRequest: 0x7f6764065130 handle_create_local_image: failed to create local image: (17) File exists

http://qa-proxy.ceph.com/teuthology/jdillaman-2019-02-23_10:41:10-rbd-wip-jd-testing-distro-basic-smithi/3631676/teuthology.log

Actions #1

Updated by Jason Dillaman about 5 years ago

Logs indicate that the leader heartbeat was delayed. It should have timed out in 5 seconds but the notifier didn't receive the timeout for over 30 seconds. It appears like the librados AIO callback thread was busy -- including lots of synchronous calls to validate the RBD data pool in the create image state machine (see issue #38500).

Going to fix the validation logic to be asynchronous (and avoid ping-ponging the validation as much). Might also need to increase the dead leader timeout (currently only takes 30 seconds to blacklist a leader).

Actions

Also available in: Atom PDF