Project

General

Profile

Actions

Bug #24744

open

rgw: index wrongly deleted when put raced with list

Added by Tianshan Qu almost 6 years ago. Updated almost 3 years ago.

Status:
Fix Under Review
Priority:
Normal
Target version:
% Done:

0%

Source:
Tags:
Backport:
octopus, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

like the issue http://tracker.ceph.com/issues/22555 , a special sequence can cause this new situation.

IO sequence:
1.put index prepare
2.list, get stale index
3.check_disk_state, find the head obj not exist
4.write head obj
5.index complete
6.aio_operate dir_suggest_changes CEPH_RGW_REMOVE

step 6 will delete the index


Files

ceph_repro.py (5.61 KB) ceph_repro.py Joseph Victor, 06/21/2021 08:35 PM
Actions #2

Updated by Abhishek Lekshmanan almost 6 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Abhishek Lekshmanan almost 6 years ago

  • Status changed from Fix Under Review to 7
  • Assignee set to J. Eric Ivancich
Actions #4

Updated by Samu Kallio almost 5 years ago

This race still exists in latest mimic (v13.2.6), we are hitting it several times a day in a production setup. The sequence of events seems to be:

1. PUT does bucket_prepare_op, creates index entry with exists=false and pending_map with CLS_RGW_OP_ADD
2. LIST sees in-flight index entry with both head object and index entry exists=false, suggests CEPH_RGW_REMOVE
3. PUT does bucket_complete_op, index entry updated, pending_map cleared
4. rgw_dir_suggest_changes deletes index entry

It seems that rgw_dir_suggest_changes tries to avoid interfering with in-flight ops by checking that the pending_map empty first: "if (cur_disk.pending_map.empty()) { ... }", but that does not account for ops that have just completed successfully.

Actions #5

Updated by Tianshan Qu almost 5 years ago

I think you are right, and the issue still exists in master, will repush the original fix.

Actions #7

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 7 to Fix Under Review
Actions #8

Updated by J. Eric Ivancich over 4 years ago

  • Backport set to nautilus,mimic
  • Pull request ID set to 28654
Actions #9

Updated by Joseph Victor almost 3 years ago

Its fairly easy to repro the bug, for what its worth: I attached a repro. Copying the file to itself makes it orders of magnitude harder to hit, but still you can hit it. The orphan can be found using rgw-orphan-list, which can be used as a workaround.

Actions #10

Updated by J. Eric Ivancich almost 3 years ago

  • Pull request ID changed from 28654 to 41978

The tests needed to be updated, so although the original PR was https://github.com/ceph/ceph/pull/28654 I copied it and added a commit to update the tests, so it's now https://github.com/ceph/ceph/pull/41978.

Actions #11

Updated by J. Eric Ivancich almost 3 years ago

  • Target version set to v17.0.0
  • Backport changed from nautilus,mimic to octopus, pacific
Actions

Also available in: Atom PDF