Project

General

Profile

Actions

Bug #52721

open

Bucket index entry not always created

Added by Pietro Di Marco over 2 years ago. Updated about 2 years ago.

Status:
Need More Info
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

BACKGROUND:
Ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
Ceph health is OK
Deployed using Rook
Using EC/REPLICATED pool with 3 RGW instances

DESCRIBE:
When use S3 LIST_OBJECT() to list, some uploaded objects cannot be listed.
I am sure that the object has been uploaded because I can get it with the exact key.
Moreover I experience this only with RGW instances > 1, with a single RGW instance the issue is not reported.

SOME ELSE:

These files are uploaded by S3a file committer. I doubt this has something to do with the issue because the HTTP requests to the RGW are identical both in case of object uploaded correctly and not

Logs even at debug-level=20 do not report any error/warning


REQUEST LOGGED BY RGW

"PUT /ceph-bkt-31c9b571-6495-4edd-a0f1-8f2cb9ff97cb/tlap_635_rewrites_5_iterations/2021-09-20T082916.979Z/rewrite_1.parquet/part-00005-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet HTTP/1.1" 
200 238 - "Hadoop 2.8.5, aws-sdk-java/1.11.751 Linux/5.6.11-200.fc31.x86_64 OpenJDK_64-Bit_Server_VM/25.232-b09 
java/1.8.0_232 scala/2.11.12 vendor/Oracle_Corporation com.amazonaws.services.s3.transfer.TransferManager/1.11.751" - latency=0.167005762s
LIST OF COMMANDS ISSUED AND RESULTS:

radosgw-admin bi list --bucket=ceph-bkt-31c9b571-6495-4edd-a0f1-8f2cb9ff97cb | grep part-00005-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet

2021-09-20T15:50:10.647+0000 7f9bc5581340  1 librados: starting msgr at
2021-09-20T15:50:10.647+0000 7f9bc5581340  1 librados: starting objecter
2021-09-20T15:50:10.647+0000 7f9bc5581340  1 librados: setting wanted keys
2021-09-20T15:50:10.647+0000 7f9bc5581340  1 librados: calling monclient init
2021-09-20T15:50:10.648+0000 7f9bc5581340  1 librados: init done
2021-09-20T15:50:11.337+0000 7f9bc5581340  2 all 8 watchers are set, enabling cache
2021-09-20T15:50:11.399+0000 7f9b86fe5700  2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start
2021-09-20T15:50:11.405+0000 7f9bc5581340  1 librados: starting msgr at
2021-09-20T15:50:11.405+0000 7f9bc5581340  1 librados: starting objecter
2021-09-20T15:50:11.405+0000 7f9bc5581340  1 librados: setting wanted keys
2021-09-20T15:50:11.405+0000 7f9bc5581340  1 librados: calling monclient init
2021-09-20T15:50:11.408+0000 7f9bc5581340  1 librados: init done
2021-09-20T15:50:13.777+0000 7f9bc5581340  2 removed watcher, disabling cache
2021-09-20T15:50:14.202+0000 7f9b86fe5700  2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start
2021-09-20T15:50:14.285+0000 7f9bc5581340  1 librados: shutdown
2021-09-20T15:50:14.286+0000 7f9bc5581340  1 librados: shutdown

FOR A FILE UPLOADED CORRECTLY:

radosgw-admin bi list --bucket=ceph-bkt-31c9b571-6495-4edd-a0f1-8f2cb9ff97cb | grep part-00004-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet
2021-09-20T15:51:14.814+0000 7f0f1820d340  1 librados: starting msgr at
2021-09-20T15:51:14.814+0000 7f0f1820d340  1 librados: starting objecter
2021-09-20T15:51:14.814+0000 7f0f1820d340  1 librados: setting wanted keys
2021-09-20T15:51:14.814+0000 7f0f1820d340  1 librados: calling monclient init
2021-09-20T15:51:14.815+0000 7f0f1820d340  1 librados: init done
2021-09-20T15:51:15.502+0000 7f0f1820d340  2 all 8 watchers are set, enabling cache
2021-09-20T15:51:15.593+0000 7f0eda7e4700  2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start
2021-09-20T15:51:15.598+0000 7f0f1820d340  1 librados: starting msgr at
2021-09-20T15:51:15.598+0000 7f0f1820d340  1 librados: starting objecter
2021-09-20T15:51:15.598+0000 7f0f1820d340  1 librados: setting wanted keys
2021-09-20T15:51:15.598+0000 7f0f1820d340  1 librados: calling monclient init
2021-09-20T15:51:15.600+0000 7f0f1820d340  1 librados: init done
        "idx": "tlap_635_rewrites_5_iterations/2021-09-20T082916.979Z/rewrite_1.parquet/part-00004-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet",
            "name": "tlap_635_rewrites_5_iterations/2021-09-20T082916.979Z/rewrite_1.parquet/part-00004-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet",
2021-09-20T15:51:17.850+0000 7f0f1820d340  2 removed watcher, disabling cache
2021-09-20T15:51:18.544+0000 7f0eda7e4700  2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start
2021-09-20T15:51:18.568+0000 7f0f1820d340  1 librados: shutdown
2021-09-20T15:51:18.569+0000 7f0f1820d340  1 librados: shutdown

OBJ EXISTS!

 rados ls -p my-store.rgw.buckets.data | grep part-00005-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet
2021-09-20T15:53:32.609+0000 7fa44c1edd00  1 librados: starting msgr at
2021-09-20T15:53:32.609+0000 7fa44c1edd00  1 librados: starting objecter
2021-09-20T15:53:32.609+0000 7fa44c1edd00  1 librados: setting wanted keys
2021-09-20T15:53:32.609+0000 7fa44c1edd00  1 librados: calling monclient init
2021-09-20T15:53:32.611+0000 7fa44c1edd00  1 librados: init done
1cabf71f-2fe9-4187-a928-537277185b62.51427781.1_tlap_635_rewrites_5_iterations/2021-09-20T082916.979Z/rewrite_1.parquet/part-00005-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet
2021-09-20T15:53:34.024+0000 7fa44c1edd00  1 librados: shutdown

OMAP KEY NOT PRESENT FOR THE OBJECT:

rados listomapkeys -p my-store.rgw.buckets.index .dir.1cabf71f-2fe9-4187-a928-537277185b62.51427781.2.1 | grep part-00005-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet
2021-09-20T15:55:14.331+0000 7f2549dc4d00  1 librados: starting msgr at
2021-09-20T15:55:14.331+0000 7f2549dc4d00  1 librados: starting objecter
2021-09-20T15:55:14.331+0000 7f2549dc4d00  1 librados: setting wanted keys
2021-09-20T15:55:14.331+0000 7f2549dc4d00  1 librados: calling monclient init
2021-09-20T15:55:14.333+0000 7f2549dc4d00  1 librados: init done
2021-09-20T15:55:14.339+0000 7f2549dc4d00  1 librados: shutdown

 rados listomapkeys -p my-store.rgw.buckets.index .dir.1cabf71f-2fe9-4187-a928-537277185b62.51427781.2.1 | grep part-00004-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet
2021-09-20T15:55:21.649+0000 7f326ae01d00  1 librados: starting msgr at
2021-09-20T15:55:21.649+0000 7f326ae01d00  1 librados: starting objecter
2021-09-20T15:55:21.649+0000 7f326ae01d00  1 librados: setting wanted keys
2021-09-20T15:55:21.649+0000 7f326ae01d00  1 librados: calling monclient init
2021-09-20T15:55:21.651+0000 7f326ae01d00  1 librados: init done
2021-09-20T15:55:21.656+0000 7f326ae01d00  1 librados: shutdowntlap_635_rewrites_5_iterations/2021-09-20T082916.979Z/rewrite_1.parquet/part-00004-0bb55c0f-755f-4299-80ee-142d65e38eb4-c000.snappy.parquet

radosgw-admin bucket limit check

 {
        "user_id": "ceph-user-ibQM7j55",
        "buckets": [
            {
                "bucket": "ceph-bkt-31c9b571-6495-4edd-a0f1-8f2cb9ff97cb",
                "tenant": "",
                "num_objects": 608,
                "num_shards": 2,
                "objects_per_shard": 304,
                "fill_status": "OK" 
            }
        ]
    }


Related issues 1 (0 open1 closed)

Related to rgw - Backport #55044: pacific: cls_rgw_suggest_changes() may accidentally delete newly-completed entriesResolvedJ. Eric IvancichActions
Actions #1

Updated by Casey Bodley over 2 years ago

  • Assignee set to J. Eric Ivancich
Actions #2

Updated by Or Friedmann over 2 years ago

Hi,
Could you please try to listomapkeys from the second shard of the bucket too and share the results?

Actions #3

Updated by Casey Bodley over 2 years ago

  • Subject changed from Bucket index not always created to Bucket index entry not always created
  • Status changed from New to Need More Info
Actions #4

Updated by Benoît Knecht about 2 years ago

We ran into the exact same issue, with a very similar setup: Ceph Pacific (16.2.7), 3 radosgw instances behind HAproxy, a large bucket with 256 shards, and an object that isn't listed but can be read with an explicit GET.

I ran `listomapkeys` on every single shard and can confirm that the object isn't listed in any one of them.

Actions #5

Updated by J. Eric Ivancich about 2 years ago

This sounds like it could result from the bug fixed here:

https://tracker.ceph.com/issues/54528

We should see if the error continues after that's backported to pacific.

Actions #6

Updated by J. Eric Ivancich about 2 years ago

J. Eric Ivancich wrote:

This sounds like it could result from the bug fixed here:

https://tracker.ceph.com/issues/54528

We should see if the error continues after that's backported to pacific.

I've started the backport for pacific. See: https://tracker.ceph.com/issues/55044

Actions #7

Updated by J. Eric Ivancich about 2 years ago

  • Related to Backport #55044: pacific: cls_rgw_suggest_changes() may accidentally delete newly-completed entries added
Actions #8

Updated by Benoît Knecht about 2 years ago

J. Eric Ivancich wrote:

This sounds like it could result from the bug fixed here:

https://tracker.ceph.com/issues/54528

We should see if the error continues after that's backported to pacific.

Sounds plausible indeed! Thanks for pointing it out, and for working on the backport.

I haven't found a reliable way to reproduce the issue though (which makes sense if it's indeed a race condition), so it won't be immediately obvious whether the patch fixed the issue, or if it simply hasn't occurred yet.

Actions

Also available in: Atom PDF