Project

General

Profile

Bug #57562

multisite replication issue on Quincy

Added by Jane Zhu 5 months ago. Updated 13 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
multisite sync backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have encountered replication issues in our multisite settings with Quincy v17.2.3.

Our Ceph clusters are brand new. We tore down our clusters and re-deployed fresh Quincy ones before we did our test.
In our environment, we have 3 RGW nodes per site, each node has 2 instances for client traffic and 1 instance dedicated for replication.

Our test was done using cosbench with the following settings:
- 10 rgw users
- 3000 buckets per user
- write only
- 6 different object sizes with the following distribution:
1k: 17%
2k: 48%
3k: 14%
4k: 5%
1M: 13%
8M: 3%
- trying to write 10 million objects per object size bucket per user to avoid writing to the same objects
- no multipart uploads involved
The test ran for about 2 hours roughly from 22:50pm 9/14 to 1:00am 9/15. And after that, the replication tail continued for another roughly 4 hours till 4:50am 9/15 with gradually decreasing replication traffic. Then the replication stopped and nothing has been going on in the clusters since.

While we were verifying the replication status, we found many issues.
1. The sync status shows the clusters are not fully synced. However all the replication traffic has stopped and nothing is going on in the clusters.
Secondary zone:

          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        behind shards: [40,74]

Why the replication stopped even though the clusters are still not in-sync?

2. We can see some buckets are not fully synced and we are able to identified some missing objects in our secondary zone.
Here is an example bucket. This is its sync status in the secondary zone.

          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
         bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78])

    source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  source bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78])
                full sync: 0/101 shards
                incremental sync: 100/101 shards
                bucket is behind on 1 shards
                behind shards: [78]

3. We can see from the above sync status, the behind shard for the example bucket is not in the list of the behind shards for the system sync status. Why is that?

4. Data sync status for these behind shards doesn't list any "pending_buckets" or "recovering_buckets".
An example:

{
    "shard_id": 74,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000003:00000000000003381964",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-09-15T00:00:08.718840Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

Shouldn't the not-yet-in-sync buckets be listed here?

5. The sync status of the primary zone is different from the sync status of the secondary zone with different groups of behind shards. The same for the sync status of the same bucket. Is it legitimate? Please see the item 1 for sync status of the secondary zone, and the item 6 for the primary zone.

6. Why the primary zone has behind shards anyway since the replication is from primary to the secondary?|
Primary Zone:

          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  metadata sync no sync (zone is master)
      data sync source: 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 30 shards
                        behind shards: [6,7,26,28,29,37,47,52,55,56,61,67,68,69,74,79,82,91,95,99,101,104,106,111,112,121,122,123,126,127]

7. We have buckets in-sync that show correct sync status in secondary zone but still show behind shards in primary. Why is that?
Secondary Zone:

          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
         bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])

    source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  source bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])
                full sync: 0/101 shards
                incremental sync: 99/101 shards
                bucket is caught up with source

Primary zone:

          realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
      zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
           zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
         bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])

    source zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
  source bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89754.279])
                full sync: 0/101 shards
                incremental sync: 97/101 shards
                bucket is behind on 11 shards
                behind shards: [9,11,14,16,22,31,44,45,67,85,90]

Our primary goals here are:
1. to find out why the replication stopped while the clusters are not in-sync;
2. to understand what we need to do resume the replication, and to make sure it runs to the end without too much lagging;
3. to understand if all the sync status info is correct. Seems to us there are many conflicts, and some doesn't reflect the real status of the clusters at all.

We attached the following info of our system:
- ceph.conf of rgws
- ceph config dump
- ceph versions output
- sync status of cluster, an in-sync bucket, a not-in-sync bucket, and some behind shards
- bucket list and bucket stats of a not-in-sync bucket and stat of a not-in-sync object

replication_issue_pri.zip - info for primary site (33 KB) Jane Zhu, 09/15/2022 02:10 PM

replication_issue_sec.zip - info for secondary site (40.3 KB) Jane Zhu, 09/15/2022 02:10 PM

cosbench_workload.tar.gz (1.51 KB) Jane Zhu, 09/28/2022 04:46 PM

workload_dev_pwdc_write_many_buckets_4h.xml.tar.gz (1.43 KB) Jane Zhu, 10/04/2022 10:41 PM

workload_dev_pwdc_write_many_buckets_10min.xml.tar.gz (217 Bytes) Jane Zhu, 10/04/2022 10:41 PM

Issue57562.tar.gz - log files for comment #10 (377 KB) Oguzhan Ozmen, 10/07/2022 08:26 PM

Comment38_TID72267_Logs.csv View (34.3 KB) Oguzhan Ozmen, 11/03/2022 07:11 PM


Related issues

Related to rgw - Bug #57783: multisite: data sync reports shards behind after source zone fully trims datalog In Progress
Copied to rgw - Backport #58402: multisite replication issue on Quincy In Progress
Copied to rgw - Backport #58403: multisite replication issue on Quincy In Progress

History

#1 Updated by Joseph Mundackal 4 months ago

We have verified that this is still a problem on main branch as well

#2 Updated by Joseph Mundackal 4 months ago

https://github.com/ceph/ceph/pull/45958 (sync fairness bidding)
https://github.com/ceph/ceph/pull/47566 (Fix races in initialization of data sync)

Potential PR's that might be related to this issue

#3 Updated by Dan Mick 4 months ago

  • Project changed from Ceph to rgw
  • Tags set to multisite sync

#4 Updated by Vidushi Mishra 4 months ago

We would like to know the workload in detail.
Would be great if the cosbench.xml file for the workload that ran for 2 hours be shared.

#5 Updated by Jane Zhu 4 months ago

cosbench workload file has been uploaded

#6 Updated by Casey Bodley 4 months ago

  • Priority changed from Normal to High

#7 Updated by Jane Zhu 4 months ago

We are able to consistently reproduce the replication issue now. The following are the environment and the steps to reproduce it.
Ceph version:

17.2.3 GA or 17.2.4 GA

Custom configs in config db:

$ sudo ceph config dump
WHO         MASK              LEVEL     OPTION                                          VALUE           RO
mon                           advanced  mon_compact_on_start                            true              
mon                           advanced  mon_max_pg_per_osd                              500               
mon                           advanced  mon_mgr_beacon_grace                            90                
mon                           advanced  mon_stat_smooth_intervals                       9                 
mon                           advanced  osd_pool_default_size                           4                 
mon                           advanced  osd_scrub_auto_repair                           true              
osd                           dev       bluestore_cache_trim_max_skip_pinned            10000             
osd                           advanced  osd_crush_initial_weight                        0.000000          
osd                           advanced  osd_deep_scrub_interval                         2592000.000000    
osd                           advanced  osd_deep_scrub_large_omap_object_key_threshold  500000            
osd                           advanced  osd_max_backfills                               1                 
osd                           advanced  osd_max_scrubs                                  10                
osd         host:osd_node_1   basic     osd_memory_target                               4294967296        
osd         host:osd_node_2  basic     osd_memory_target                               4294967296        
osd         host:osd_node_3   basic     osd_memory_target                               4294967296        
osd         host:osd_node_4   basic     osd_memory_target                               4294967296        
osd                           advanced  osd_op_thread_timeout                           60                
osd                           advanced  osd_scrub_auto_repair                           true              
osd                           advanced  osd_scrub_auto_repair_num_errors                5                 
osd                           advanced  osd_scrub_begin_hour                            0                 
osd                           advanced  osd_scrub_during_recovery                       false             
osd                           advanced  osd_scrub_end_hour                              23                
osd                           advanced  osd_scrub_max_interval                          604800.000000     
osd                           advanced  osd_scrub_min_interval                          259200.000000     
osd                           advanced  osd_scrub_sleep                                 0.050000          
osd.0                         basic     osd_mclock_max_capacity_iops_hdd                18632.987170      
osd.1                         basic     osd_mclock_max_capacity_iops_hdd                19001.305326      
osd.10                        basic     osd_mclock_max_capacity_iops_hdd                19538.878049      
osd.11                        basic     osd_mclock_max_capacity_iops_hdd                17584.470315      
osd.2                         basic     osd_mclock_max_capacity_iops_hdd                18656.206041      
osd.3                         basic     osd_mclock_max_capacity_iops_hdd                18430.691608      
osd.4                         basic     osd_mclock_max_capacity_iops_hdd                20036.659741      
osd.5                         basic     osd_mclock_max_capacity_iops_hdd                19520.095460      
osd.6                         basic     osd_mclock_max_capacity_iops_hdd                18263.526765      
osd.7                         basic     osd_mclock_max_capacity_iops_hdd                18016.738667      
osd.8                         basic     osd_mclock_max_capacity_iops_hdd                19053.610592      
osd.9                         basic     osd_mclock_max_capacity_iops_hdd                20066.962652      
client.rgw                    advanced  objecter_inflight_op_bytes                      5368709120        
client.rgw                    advanced  objecter_inflight_ops                           102400            
client.rgw                    advanced  rgw_lc_max_worker                               3                 
client.rgw                    advanced  rgw_lc_max_wp_worker                            3                 
client.rgw                    advanced  rgw_lifecycle_work_time                         00:00-23:59     * 
client.rgw                    basic     rgw_max_concurrent_requests                     2048              

Multisite clusters settings:

2 clusters, each has 3 mons, 4 osds, 2 rgws 
Each rgw has 2 client traffic rgws and 2 replication rgws

Testing tool:

cosbench

Reproduce steps:

1. Create 2 vm clusters for primary site and secondary site.
2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites.
3. Setup custom configs on mons of both clusters.
4. On the primary site, create 10 rgw users for cosbench tests, and set the max-buckets of each user to 10,000. 
5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and generate 10 mins write-only traffic. Please see the attached workload file.
6. Run a cosbench workload to create another 30,000 buckets and generate 4 hours of write-only traffic. Please see the attached workload file.
7. We observed “behind shards” in sync status after the 4-hr cosbench test, and the replication didn’t catch up over time.

Cluster status:
1) Primary site:

$ sudo radosgw-admin sync status
          realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
      zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
           zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master)
  metadata sync no sync (zone is master)
      data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 3 shards
                        behind shards: [18,48,64]

$ sudo radosgw-admin data sync status --shard-id=48 --source-zone=dev-zone-bcc-secondary
{
    "shard_id": 48,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000001:00000000000000001013",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-10-03T05:56:20.319563Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

2) Secondary site:

$ sudo radosgw-admin sync status
          realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
      zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
           zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [31]
$ sudo radosgw-admin data sync status --shard-id=31 --source-zone=dev-zone-bcc-master
{
    "shard_id": 31,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000001:00000000000000000512",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2022-10-03T05:56:03.944817Z" 
    },
    "pending_buckets": [],
    "recovering_buckets": []
}

Some error/fail log lines we observed:
1) Primary site

2022-10-02T23:15:12.482-0400 7fbf6a819700  1 req 8882223748441190067 0.001000015s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
…
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=10345
…
2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, rgw::cls::fifo::Completion<rgw::cls::fifo::Trimmer>::Ptr&&, int):1858 trim failed: r=-5 tid=14844
...

2) Secondary site

...
2022-10-02T23:15:50.279-0400 7f679a2ce700  1 req 16201632253829371026 0.001000002s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
...

We did a bucket sync run on a broken bucket, but nothing happened and the bucket still didn't sync.

$ sudo radosgw-admin bucket sync run --bucket=jjm-4hr-test-1k-thisisbcstestload0011007 --source-zone=dev-zone-bcc-secondary

#8 Updated by Casey Bodley 4 months ago

  • Related to Bug #57783: multisite: data sync reports shards behind after source zone fully trims datalog added

#9 Updated by Jane Zhu 4 months ago

The difference between this issue and Bug #57783 is that in our case, the buckets/objects are NOT synced.
I tried a small workload after that, and the new buckets/objects were actually synced.

#10 Updated by Oguzhan Ozmen 4 months ago

Hi,

Here's some extra data from another test which used increased rgw debugging levels by feeding in the options debug_rgw 20 and debug_ms 1 to all radosgw instances. We hope the extra debugging level might help shed more light on the issue.

Test setup

Configuration-wise, the test setup is an exact replica what Jane described in comment https://tracker.ceph.com/issues/57562#note-7 and this run uses ceph version 17.2.4 GA.

Each cluster has 2 rgw nodes and each rgw node has only 1 radosgw instance which serves both client and replication traffic.

Master nodes:
  • dev-10-74-3-204
  • dev-10-74-0-235
    Secondary nodes:
  • dev-10-74-3-61
  • dev-10-74-3-60

Workload details

We used 5 minute iterations: after running the workload for 5 mins, the script checks bucket-stats from both clusters and moves on to the next iteration if there's no mismatch (in number of objects).

In the first several iterations, we used only warp but later cosbench was added as well. Iteration 38 reproduced the issue. It ran between 23:49 and 23:54 on Oct 5th EST: 03:49-03:54 Oct 6 UTC. Each iteration introduces 1200 (cosbench) + 1 (warp) new buckets. When the issue was reproduced, the system had ~57K buckets in total.

Logs

NOTE: we executed bucket-sync-run on bucket cosinc38-dev-1k-thisisbcstestload00124. It did NOT resolve the issue (i.e., didn't recover the missing objects) but it reset the bucket-sync-status for this bucket: bucket-sync-status didn't report behind-shards notice anymore for this bucket but the object was still missing on the secondary cluster.

Attached log file Issue57562.tar.gz captures the output of the following commands on both clusters:
  • bucket stats --bucket=<per_bucket_for_19_buckets>
  • bucket sync status --bucket=<per_bucket_for_19_buckets>
  • bilog status --bucket=<per_bucket_for_19_buckets>
  • bucket list --max-entries=1000000 --bucket=<per_bucket_for_19_buckets>
  • sync status
  • mdlog & datalog status

The extended rgw log is around 1.4GB(zipped)(per node). This tracker's file-upload limits are pretty restrictive so I cannot upload them here.

Is there any cloud file service that you use for sharing large files?

Observations

bucket & cluster level sync status

Looking at all 19 buckets' (with missing objects) sync-status, they all complain about 1 shard except bucket_sync_status_cosinc38-dev-1m-thisisbcstestload009107:

                incremental sync: 11/11 shards
                bucket is behind on 1 shards
                behind shards: [4]

and the cluster-level sync-status report only datalog shard 45 as the behind shard:

      data sync source: 306054a8-ce80-47a1-a31f-50283c8f3db5 (dev-zone-bcc-master)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [45]

Picking a particular bucket and its missing object:

  • bucket: cosinc38-dev-1k-thisisbcstestload00124
  • instance: 306054a8-ce80-47a1-a31f-50283c8f3db5.5442.28281
  • missing object: myobjects10001494

The PUT request that adds the object to the bucket thru the master node dev-10-74-0-235:

2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 20 req 13899284916862016031 0.000000000s s3:put_obj get_obj_state: rctx=0x7f81efa2aad0 obj=cosinc38-dev-1k-thisisbcstestload00124:myobjects10001494 state=0x55a939918de8 s->prefetch_data=0
2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 20 req 13899284916862016031 0.000000000s s3:put_obj get_obj_state: rctx=0x7f81efa2aad0 obj=cosinc38-dev-1k-thisisbcstestload00124:myobjects10001494 state=0x55a939918de8 s->prefetch_data=0
2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 15 req 13899284916862016031 0.000000000s s3:put_obj string_to_sign=PUT

application/octet-stream
Thu, 06 Oct 2022 03:54:20 GMT
/cosinc38-dev-1k-thisisbcstestload00124/myobjects10001494
2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 10 req 13899284916862016031 0.000000000s string_to_sign:
PUT

application/octet-stream
Thu, 06 Oct 2022 03:54:20 GMT
/cosinc38-dev-1k-thisisbcstestload00124/myobjects10001494
2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 10 req 13899284916862016031 0.000000000s get_canon_resource(): dest=/cosinc38-dev-1k-thisisbcstestload00124/myobjects10001494
2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 20 req 13899284916862016031 0.000000000s final domain/bucket subdomain= domain= in_hosted_domain=0 in_hosted_domain_s3website=0 s->info.domain= s->info.request_uri=/cosinc38-dev-1k-thisisbcstestload00124/myobjects10001494
2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 20 SCRIPT_URI=/cosinc38-dev-1k-thisisbcstestload00124/myobjects10001494
2022-10-05T23:54:20.336-0400 dev-10-74-0-235 2022-10-05T23:54:20.336-0400 7f83a671c700 20 REQUEST_URI=/cosinc38-dev-1k-thisisbcstestload00124/myobjects10001494
2022-10-05T23:54:20.350-0400 dev-10-74-0-235 2022-10-05T23:54:20.350-0400 7f82ebda7700  1 -- 10.74.0.235:0/1505776952 --> [v2:10.74.4.46:6816/17658,v1:10.74.4.46:6817/17658] -- osd_op(unknown.0.0:46744205 8.6 8:6026661c:::306054a8-ce80-47a1-a31f-50283c8f3db5.5442.28281_myobjects10001494:head [create,setxattr user.rgw.idtag (63) in=77b,setxattr user.rgw.tail_tag (63) in=80b,writefull 0~1000 in=1000b,setxattr user.rgw.manifest (390) in=407b,setxattr user.rgw.acl (177) in=189b,setxattr user.rgw.content_type (25) in=46b,setxattr user.rgw.etag (32) in=45b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e963) v8 -- 0x55a93cc4a000 con 0x55a937636800
2022-10-05T23:54:20.360-0400 dev-10-74-0-235 2022-10-05T23:54:20.360-0400 7f8623436700  1 -- 10.74.0.235:0/1505776952 <== osd.11 v2:10.74.4.46:6816/17658 2205000 ==== osd_op_reply(46744205 306054a8-ce80-47a1-a31f-50283c8f3db5.5442.28281_myobjects10001494 [create,setxattr (63),setxattr (63),writefull 0~1000,setxattr (390),setxattr (177),setxattr (25),setxattr (32),call,setxattr (4)] v963'58683 uv58683 ondisk = 0) v8 ==== 587+0+0 (crc 0 0 0) 0x55a9458f2fc0 con 0x55a937636800
2022-10-05T23:54:20.385-0400 dev-10-74-0-235 2022-10-05T23:54:20.385-0400 7f82ebda7700  1 beast: 0x7f81efa2b730: 10.74.4.19 - cosbench-001 [05/Oct/2022:23:54:20.336 -0400] "PUT /cosinc38-dev-1k-thisisbcstestload00124/myobjects10001494 HTTP/1.1" 200 1000 - "aws-sdk-java/1.10.76 Linux/3.10.0-1160.59.1.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.172-b11/1.8.0_172" - latency=0.048998725s

After that, the logs show no attempt from secondary site at all to sync this object afterwards. Looking at the data log notifications right after:

# I assume this is for the above PUT
2022-10-05T23:54:20.362-0400 dev-10-74-0-235 2022-10-05T23:54:20.362-0400 7f82ebda7700 20 req 13899284916862016031 0.025999324s s3:put_obj RGWDataChangesLog::add_entry() bucket.name=cosinc38-dev-1k-thisisbcstest
load00124 shard_id=4 now=2022-10-05T23:54:20.363021-0400 cur_expiration=1969-12-31T19:00:00.000000-0500

# here secondary node processes change for another bucket
2022-10-05T23:54:20.390-0400 dev-10-74-3-61 2022-10-05T23:54:20.390-0400 7fb3cdbb9700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-2k-thisisbcstestload00747:306054a8-ce80-4
7a1-a31f-50283c8f3db5.5656.28378:4

# then both master nodes dev-10-74-0-235 & dev-10-74-3-204
# emits a series of notifications
# all for datalog shard 45 for several different buckets

# all the buckets listed below for dev-10-74-0-235 ended up having missing objects
#  * cosinc38-dev-1k-thisisbcstestload00124 (target of the above PUT request)
#  * cosinc38-dev-2k-thisisbcstestload00648
#  * cosinc38-dev-1m-thisisbcstestload00793
#  * cosinc38-dev-3k-thisisbcstestload00166
#  * cosinc38-dev-3k-thisisbcstestload00760
2022-10-05T23:54:20.416-0400 dev-10-74-0-235 2022-10-05T23:54:20.416-0400 7f85ff3ee700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-1k-thisisbcstestload00124:306054a8-ce80-
47a1-a31f-50283c8f3db5.5442.28281:4
2022-10-05T23:54:20.616-0400 dev-10-74-0-235 2022-10-05T23:54:20.616-0400 7f85ff3ee700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-2k-thisisbcstestload00648:306054a8-ce80-
47a1-a31f-50283c8f3db5.5442.28456:4
2022-10-05T23:54:20.817-0400 dev-10-74-0-235 2022-10-05T23:54:20.817-0400 7f85ff3ee700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-1m-thisisbcstestload00793:306054a8-ce80-
47a1-a31f-50283c8f3db5.5656.28814:4
2022-10-05T23:54:20.877-0400 dev-10-74-3-204 2022-10-05T23:54:20.877-0400 7f958229b700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-2k-thisisbcstestload00930:306054a8-ce80-
47a1-a31f-50283c8f3db5.5442.28402:4
2022-10-05T23:54:21.017-0400 dev-10-74-0-235 2022-10-05T23:54:21.017-0400 7f85ff3ee700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-3k-thisisbcstestload00760:306054a8-ce80-
47a1-a31f-50283c8f3db5.5442.28574:4
2022-10-05T23:54:21.217-0400 dev-10-74-0-235 2022-10-05T23:54:21.217-0400 7f85ff3ee700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-1m-thisisbcstestload010101:306054a8-ce80
-47a1-a31f-50283c8f3db5.5656.28342:9
2022-10-05T23:54:21.477-0400 dev-10-74-3-204 2022-10-05T23:54:21.477-0400 7f958229b700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-3k-thisisbcstestload00357:306054a8-ce80-
47a1-a31f-50283c8f3db5.5442.28713:4
2022-10-05T23:54:21.618-0400 dev-10-74-0-235 2022-10-05T23:54:21.618-0400 7f85ff3ee700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-3k-thisisbcstestload00166:306054a8-ce80-
47a1-a31f-50283c8f3db5.5442.28571:4
2022-10-05T23:54:21.877-0400 dev-10-74-3-204 2022-10-05T23:54:21.877-0400 7f958229b700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-1m-thisisbcstestload001107:306054a8-ce80
-47a1-a31f-50283c8f3db5.5442.28274:9

# later secondary node starts processing the changes:
# they were all originated from master node dev-10-74-3-204
2022-10-05T23:54:21.991-0400 dev-10-74-3-61 2022-10-05T23:54:21.991-0400 7fb3cdbb9700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-3k-thisisbcstestload00357:306054a8-ce80-4
7a1-a31f-50283c8f3db5.5442.28713:4
2022-10-05T23:54:22.077-0400 dev-10-74-3-204 2022-10-05T23:54:22.077-0400 7f958229b700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-2k-thisisbcstestload00945:306054a8-ce80-
47a1-a31f-50283c8f3db5.5656.28655:4
2022-10-05T23:54:22.391-0400 dev-10-74-3-61 2022-10-05T23:54:22.391-0400 7fb3cdbb9700 20 rgw rados thread: process(): notifying datalog change, shard_id=45: cosinc38-dev-1m-thisisbcstestload001107:306054a8-ce80-
47a1-a31f-50283c8f3db5.5442.28274:9,cosinc38-dev-2k-thisisbcstestload00945:306054a8-ce80-47a1-a31f-50283c8f3db5.5656.28655:4

It's as if during above ~1.5 second period from 23:54:20.362 to 23:54:21.991, some of the datalog changes reported by master node dev-10-74-0-235 was lost or overwritten by the ones from the other master node dev-10-74-3-204.

#11 Updated by Jane Zhu 4 months ago

Are there any suggestions/tips on how we can debug this type of multisite/replication issues?

#12 Updated by Jane Zhu 3 months ago

We have an example scenario here where one of the objects in a bucket failed to be synced to the secondary.
  • Mdlog and datalog on both primary and secondary sites are all empty.
  • The bilog on primary site has a few entries for the objects.
  • The bilog on the secondary is empty too.
    Any insight on how the clusters ended up in this situation?

More information here
-------------------------------------------------
Bucket:

cosinc4-dev-8m-thisisbcstestload010122

Object failed to sync:

myobjects110000112

Sync status:

Primary
$ sudo radosgw-admin -n client.rgw.dev-10-34-29-82.dev.1 sync status                                                                      
          realm 3b185cd8-17af-4756-b9f2-e43d2a1a2b12 (dev-realm)
      zonegroup b9eb36ca-fbfb-4dd1-a56e-64cc4a8fe579 (dev-zonegroup)
           zone 2f897fdc-ae8a-44dd-897d-f008ee560ac9 (dev-zone-bcc-master)
   current time 2022-10-17T16:06:01Z
zonegroup features enabled: resharding
  metadata sync no sync (zone is master)
      data sync source: 29985b41-df64-4d09-b572-3c56d6c435a0 (dev-zone-bcc-secondary)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

Secondary
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 sync status                                                                              
          realm 3b185cd8-17af-4756-b9f2-e43d2a1a2b12 (dev-realm)
      zonegroup b9eb36ca-fbfb-4dd1-a56e-64cc4a8fe579 (dev-zonegroup)
           zone 29985b41-df64-4d09-b572-3c56d6c435a0 (dev-zone-bcc-secondary)
   current time 2022-10-17T16:06:10Z
zonegroup features enabled: resharding
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 2f897fdc-ae8a-44dd-897d-f008ee560ac9 (dev-zone-bcc-master)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 8 shards
                        behind shards: [26,32,34,61,92,98,108,121]

mdlog:

Primary
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 mdlog list
No --period given, using current period=75854c8d-c4fd-4e53-9d50-7c4f2fabeabc
[]

Secondary
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 mdlog list
No --period given, using current period=75854c8d-c4fd-4e53-9d50-7c4f2fabeabc
[]

datalog:

Primary
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 datalog list
[] 

Secondary
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 datalog list
[] 

Objects in bucket cosinc4-dev-8m-thisisbcstestload010122:

Primary
$ sudo radosgw-admin -n client.rgw.dev-10-34-29-82.dev.1 bucket list --bucket=cosinc4-dev-8m-thisisbcstestload010122 | grep "\"name\"" 
        "name": "myobjects110000012",
        "name": "myobjects110000032",
        "name": "myobjects110000052",
        "name": "myobjects110000072",
        "name": "myobjects110000092",
        "name": "*myobjects110000112*",
        "name": "myobjects110000132",

Secondary
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 bucket list --bucket=cosinc4-dev-8m-thisisbcstestload010122 | grep "\"name\"" 
        "name": "myobjects110000012",
        "name": "myobjects110000032",
        "name": "myobjects110000052",
        "name": "myobjects110000072",
        "name": "myobjects110000092",
        "name": "myobjects110000132",

"myobjects110000112" is missing from the secondary site.

bilog:

Primary
[jzhu4@dev-10-34-29-82 test_bucket]$ sudo radosgw-admin -n client.rgw.dev-10-34-29-82.dev.1 bilog list --bucket=cosinc4-dev-8m-thisisbcstestload010122 | egrep -i 'op_id|object'
        "op_id": "9#00000000021.197227.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000022.200024.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000023.201286.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000024.202134.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000025.202242.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000026.202462.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000027.203071.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000028.203589.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000029.203893.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000030.204574.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000031.204619.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000032.204694.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000033.206173.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000034.206570.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000035.207327.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000036.207697.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000037.208075.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000038.208731.5",
        "object": "myobjects110000112",
        "op_id": "9#00000000039.208959.5",
        "object": "myobjects110000112",

Secondary
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 bilog list --bucket=cosinc4-dev-8m-thisisbcstestload010122
[]

#13 Updated by Jane Zhu 3 months ago

We pretty much narrowed down what the problem is: a race condition has been identified in FIFO::_prepare_new_head(..) in cls_fifo_legacy.cc.

This race condition caused 2 datalog objects being created at about the same time, and in turn caused the datalog object (data_log.34.nysVjQ692KY.2 in this example) the rgw wrote to was not the object (data_log.34.nysVjQ692KY.1 in this example) it tried to read from,
hence nothing was returned to the GET request to get datalog entries, hence sync was not able to finish for this datalog shard.

Please see the following details.

Bucket:

cosinc4-dev-8m-thisisbcstestload010122

Bucket instance id:

2f897fdc-ae8a-44dd-897d-f008ee560ac9.5191.4200

Object:

myobjects110000112

Datalog shard:

34

Rados objects for datalog shard 34:

$ sudo rados -p dev-zone-bcc-master.rgw.log ls --all | grep data_log.34
        data_log.34.nysVjQ692KY.1
        data_log.34.nysVjQ692KY.2
        data_log.34

Log lines for adding an entry to datalog. Please note it's being added to data_log.34.nysVjQ692KY.2 .

2022-10-15T21:41:43.174-0400 7f974a096700 20 req 2818992824181721259 0.272002399s s3:put_obj RGWDataChangesLog::add_entry() bucket.name=cosinc4-dev-8m-thisisbcstestload010122 shard_id=9 now=2022-10-15T21:41:43.175106-0400 cur_expiration=1969-12-31T19:00:00.000000-0500
...
2022-10-15T21:41:43.174-0400 7f974a096700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.24.193:6808/16505,v1:10.34.24.193:6809/16505] -- osd_op(unknown.0.0:22512711 3.1a 3:5e739a84:::data_log.34.nysVjQ692KY.2:head [call fifo.push_part in=169b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e161) v8 -- 0x5644ee5ad000 con 0x5644e530c800

Log lines for reading datalog entries. Please note it's reading from data_log.34.nysVjQ692KY.1.

2022-10-15T21:42:01.654-0400 7f940ca1c700 20 HTTP_ACCEPT=*/*
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 HTTP_AUTHORIZATION=AWS4-HMAC-SHA256 Credential=P9Eb6S8XNyo4dtZZUUMx/20221016/dev-zonegroup/s3/aws4_request,SignedHeaders=date;host;x-amz-content-sha256;x-amz-date,Signatur
e=55b18947ef8f4855777d7364560dbe54c4e6429fc6cbee0d970fb9b067445789
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 HTTP_DATE=Sun, 16 Oct 2022 01:42:01 +0000
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 HTTP_HOST=10.34.29.82:8080
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 HTTP_VERSION=1.1
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 HTTP_X_AMZ_CONTENT_SHA256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 HTTP_X_AMZ_DATE=20221016T014201Z
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 QUERY_STRING=type=data&id=34&marker=00000000000000000001%3A00000000000000000684&extra-info=true&rgwx-zonegroup=b9eb36ca-fbfb-4dd1-a56e-64cc4a8fe579
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 REMOTE_ADDR=10.34.27.236
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 REQUEST_METHOD=GET
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 REQUEST_URI=/admin/log/?type=data&id=34&marker=00000000000000000001%3A00000000000000000684&extra-info=true&rgwx-zonegroup=b9eb36ca-fbfb-4dd1-a56e-64cc4a8fe579
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 SCRIPT_URI=/admin/log/
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 SERVER_PORT=8080
2022-10-15T21:42:01.654-0400 7f940ca1c700  1 ====== starting new request req=0x7f93d8083700 =====
...
2022-10-15T21:42:01.654-0400 7f940ca1c700 20 req 7552433910555006885 0.000000000s :list_data_changes_log int rgw::cls::fifo::FIFO::list(const DoutPrefixProvider*, int, std::optional<std::basic_string_view<char> >, std::vector<rgw::cls::fifo::list_entry>*, bool*, optional_yield):1618 max_entries=1000 tid=20481
...
2022-10-15T21:42:01.654-0400 7f940ca1c700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.26.239:6800/15442,v1:10.34.26.239:6801/15442] -- osd_op(unknown.0.0:22576464 3.c 3:31aba04f:::data_log.34.nysVjQ692KY.1:head [call fifo.part_list in=32b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e161) v8 -- 0x5644eefbe000 con 0x5644e9541800

The race condition and what was happening after that.
  1. Put requests for buckets cosinc4-dev-3k-thisisbcstestload00761 and cosinc4-dev-2k-thisisbcstestload00740 were handled by 2 separate rgw threads at pretty much the same time, one for each.
  2. Both identified that the data_log.34.nysVjQ692KY.0 was full, and tried to create a new object.
  3. data_log.34.nysVjQ692KY.1 was created to hold the entry for bucket cosinc4-dev-3k-thisisbcstestload00761, and a few moments later data_log.34.nysVjQ692KY.2 was created to hold the entry for bucket cosinc4-dev-2k-thisisbcstestload00740.
  4. The log entry for each bucket was successfully added to its corresponding datalog object.
  5. All the subsequent add_entry to this datalog shard went to data_log.34.nysVjQ692KY.2.
  6. All the subsequent queries for this datalog shard went to data_log.34.nysVjQ692KY.1, and stopped there since in data_log.34.nysVjQ692KY.1 "head part is not full, so we can assume we're done". Whatever in data_log.34.nysVjQ692KY.2 were never reached.

Bucket cosinc4-dev-3k-thisisbcstestload00761 (which has entry in data_log.34.nysVjQ692KY.1) was successfully synced.
Bucket cosinc4-dev-2k-thisisbcstestload00740 (which has entry in data_log.34.nysVjQ692KY.2) was NOT synced.
On primary:

$ sudo radosgw-admin -n client.rgw.dev-10-34-29-82.dev.1 bucket stats --bucket=cosinc4-dev-3k-thisisbcstestload00761 | grep num_objects
            "num_objects": 30
$ sudo radosgw-admin -n client.rgw.dev-10-34-29-82.dev.1 bucket stats --bucket=cosinc4-dev-2k-thisisbcstestload00740 | grep num_objects
            "num_objects": 97

On Secondary:

$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 bucket stats --bucket=cosinc4-dev-3k-thisisbcstestload00761 | grep num_objects
            "num_objects": 30
$ sudo radosgw-admin -n client.rgw.dev-10-34-27-236.dev.1 bucket stats --bucket=cosinc4-dev-2k-thisisbcstestload00740  | grep num_objects
            "num_objects": 92

Please see the log lines for the racing condition where data_log.34.nysVjQ692KY.1 and data_log.34.nysVjQ692KY.2 were created.
data_log.34.nysVjQ692KY.1:

...
2022-10-15T21:40:16.420-0400 7f96b9775700 20 req 14275466402377875209 0.077000678s s3:put_obj RGWDataChangesLog::add_entry() bucket.name=cosinc4-dev-3k-thisisbcstestload00761 shard_id=10 now=2022-10-15T21:40:16.420943-0400 cur_expiration=1969-12-31T19:00:00.000000-0500
...
2022-10-15T21:40:16.420-0400 7f96b9775700 20 req 14275466402377875209 0.077000678s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20069
...
2022-10-15T21:40:16.420-0400 7f96b9775700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.26.239:6816/18386,v1:10.34.26.239:6817/18386] -- osd_op(unknown.0.0:22227402 3.10 3:08921aef:::data_log.34.nysVjQ692KY.0:head [call fifo.push_part in=169b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e161) v8 -- 0x5644ff7e7800 con 0x5644e5a78800
...
2022-10-15T21:40:16.446-0400 7f96b9775700 -1 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 -1 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=20069
...
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1387 need new head tid=20069
...
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):835 need new part: tid=20069
...
2022-10-15T21:40:16.446-0400 7f96b9775700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.24.193:6816/17977,v1:10.34.24.193:6817/17977] -- osd_op(unknown.0.0:22227499 3.1d 3:bc2c97bb:::data_log.34:head [call fifo.update_meta in=143b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e161) v8 -- 0x5644ee029400 con 0x5644e9184c00
...
2022-10-15T21:40:16.483-0400 7f96b9775700 20 req 14275466402377875209 0.140001252s s3:put_obj int rgw::cls::fifo::FIFO::create_part(const DoutPrefixProvider*, int64_t, std::string_view, uint64_t, optional_yield):585 entering: tid=20069
...
2022-10-15T21:40:16.483-0400 7f96b9775700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.26.239:6800/15442,v1:10.34.26.239:6801/15442] -- osd_op(unknown.0.0:22227635 3.c 3:31aba04f:::data_log.34.nysVjQ692KY.1:head [create,call fifo.init_part in=69b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e161) v8 -- 0x5644f0ba3800 con 0x5644e9541800
...
2022-10-15T21:40:16.522-0400 7f96b9775700 20 req 14275466402377875209 0.179001585s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20069
...
2022-10-15T21:40:16.523-0400 7f96b9775700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.26.239:6800/15442,v1:10.34.26.239:6801/15442] -- osd_op(unknown.0.0:22227768 3.c 3:31aba04f:::data_log.34.nysVjQ692KY.1:head [call fifo.push_part in=169b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e161) v8 -- 0x5644e770f800 con 0x5644e9541800
...
2022-10-15T21:40:16.562-0400 7f96b9775700  2 req 14275466402377875209 0.219001934s s3:put_obj completing

data_log.34.nysVjQ692KY.2:

...
2022-10-15T21:40:16.519-0400 7f952ec60700 20 req 3660090314219285824 0.086000763s s3:put_obj RGWDataChangesLog::add_entry() bucket.name=cosinc4-dev-2k-thisisbcstestload00740 shard_id=10 now=2022-10-15T21:40:16.519779-0400 cur_expiration=1969-12-31T19:00:00.000000-0500
...
2022-10-15T21:40:16.519-0400 7f952ec60700 20 req 3660090314219285824 0.086000763s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20070
...
2022-10-15T21:40:16.519-0400 7f952ec60700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.26.239:6816/18386,v1:10.34.26.239:6817/18386] -- osd_op(unknown.0.0:22227757 3.10 3:08921aef:::data_log.34.nysVjQ692KY.0:head [call fifo.push_part in=169b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e161) v8 -- 0x5644ee7aac00 con 0x5644e5a78800
...
2022-10-15T21:40:16.537-0400 7f952ec60700 -1 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 -1 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=20070
...
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1387 need new head tid=20070
...
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):835 need new part: tid=20070
...
2022-10-15T21:40:16.537-0400 7f952ec60700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.24.193:6816/17977,v1:10.34.24.193:6817/17977] -- osd_op(unknown.0.0:22227836 3.1d 3:bc2c97bb:::data_log.34:head [call fifo.update_meta in=143b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e161) v8 -- 0x5644e824d000 con 0x5644e9184c00
...
2022-10-15T21:40:16.563-0400 7f952ec60700 20 req 3660090314219285824 0.130001158s s3:put_obj int rgw::cls::fifo::FIFO::create_part(const DoutPrefixProvider*, int64_t, std::string_view, uint64_t, optional_yield):585 entering: tid=20070
...
2022-10-15T21:40:16.563-0400 7f952ec60700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.24.193:6808/16505,v1:10.34.24.193:6809/16505] -- osd_op(unknown.0.0:22227926 3.1a 3:5e739a84:::data_log.34.nysVjQ692KY.2:head [create,call fifo.init_part in=69b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e161) v8 -- 0x5644ecd4ec00 con 0x5644e530c800
...
2022-10-15T21:40:16.639-0400 7f952ec60700 20 req 3660090314219285824 0.206001818s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20070
...
2022-10-15T21:40:16.639-0400 7f952ec60700  1 -- 10.34.29.82:0/3310533078 --> [v2:10.34.24.193:6808/16505,v1:10.34.24.193:6809/16505] -- osd_op(unknown.0.0:22228219 3.1a 3:5e739a84:::data_log.34.nysVjQ692KY.2:head [call fifo.push_part in=169b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e161) v8 -- 0x5644fc0ae800 con 0x5644e530c800
...
2022-10-15T21:40:16.662-0400 7f952ec60700  2 req 3660090314219285824 0.229002029s s3:put_obj completing

#14 Updated by Matt Benjamin 3 months ago

  • Assignee set to Adam Emerson

thanks, that's very interesting

#15 Updated by Adam Emerson 3 months ago

Can we get a more complete log snippet? All the FIFO logging with the relevant TIDs would make tracing what's going on easier.

#16 Updated by Jane Zhu 3 months ago

Please see the following FIFO log snippets. And please let me know if you need more.

The creation of data_log.34.nysVjQ692KY.1:

2022-10-15T21:40:16.420-0400 7f96b9775700 20 req 14275466402377875209 0.077000678s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1320 entering: tid=20069
2022-10-15T21:40:16.420-0400 7f96b9775700 20 req 14275466402377875209 0.077000678s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1358 preparing push: remaining=1 batch=0 retries=0 tid=20069
2022-10-15T21:40:16.420-0400 7f96b9775700 20 req 14275466402377875209 0.077000678s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1377 prepared push: remaining=0 batch=1 retries=0 batch_len=172 tid=20069
2022-10-15T21:40:16.420-0400 7f96b9775700 20 req 14275466402377875209 0.077000678s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 -1 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 -1 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1387 need new head tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):825 entering: tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):835 need new part: tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):747 entering: tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):767 needs new head: tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):780 updating metadata: i=0 tid=20069
2022-10-15T21:40:16.446-0400 7f96b9775700 20 req 14275466402377875209 0.103000909s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):464 entering: tid=20069
2022-10-15T21:40:16.483-0400 7f96b9775700 20 req 14275466402377875209 0.140001252s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, const rados::cls::fifo::update&, uint64_t):440 entering: tid=20069
2022-10-15T21:40:16.483-0400 7f96b9775700 20 req 14275466402377875209 0.140001252s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):624 entering: tid=20069
2022-10-15T21:40:16.483-0400 7f96b9775700 20 req 14275466402377875209 0.140001252s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::create, part_num: 1, part_tag: PrBLHvmXjWkTtWxe tid=20069
2022-10-15T21:40:16.483-0400 7f96b9775700 20 req 14275466402377875209 0.140001252s s3:put_obj int rgw::cls::fifo::FIFO::create_part(const DoutPrefixProvider*, int64_t, std::string_view, uint64_t, optional_yield):585 entering: tid=20069
2022-10-15T21:40:16.505-0400 7f96b9775700 20 req 14275466402377875209 0.162001446s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::set_head, part_num: 1, part_tag: PrBLHvmXjWkTtWxe tid=20069
2022-10-15T21:40:16.505-0400 7f96b9775700 20 req 14275466402377875209 0.162001446s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):681 postprocessing: i=0 tid=20069
2022-10-15T21:40:16.505-0400 7f96b9775700 20 req 14275466402377875209 0.162001446s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):464 entering: tid=20069
2022-10-15T21:40:16.522-0400 7f96b9775700 20 req 14275466402377875209 0.179001585s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, const rados::cls::fifo::update&, uint64_t):440 entering: tid=20069
2022-10-15T21:40:16.522-0400 7f96b9775700 20 req 14275466402377875209 0.179001585s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1358 preparing push: remaining=0 batch=1 retries=1 tid=20069
2022-10-15T21:40:16.522-0400 7f96b9775700 20 req 14275466402377875209 0.179001585s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1377 prepared push: remaining=0 batch=1 retries=1 batch_len=172 tid=20069
2022-10-15T21:40:16.522-0400 7f96b9775700 20 req 14275466402377875209 0.179001585s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20069

The creation of data_log.34.nysVjQ692KY.2:

2022-10-15T21:40:16.519-0400 7f952ec60700 20 req 3660090314219285824 0.086000763s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1320 entering: tid=20070
2022-10-15T21:40:16.519-0400 7f952ec60700 20 req 3660090314219285824 0.086000763s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1358 preparing push: remaining=1 batch=0 retries=0 tid=20070
2022-10-15T21:40:16.519-0400 7f952ec60700 20 req 3660090314219285824 0.086000763s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1377 prepared push: remaining=0 batch=1 retries=0 batch_len=172 tid=20070
2022-10-15T21:40:16.519-0400 7f952ec60700 20 req 3660090314219285824 0.086000763s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 -1 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 -1 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1387 need new head tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):825 entering: tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):835 need new part: tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):747 entering: tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):767 needs new head: tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):780 updating metadata: i=0 tid=20070
2022-10-15T21:40:16.537-0400 7f952ec60700 20 req 3660090314219285824 0.104000919s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):464 entering: tid=20070
2022-10-15T21:40:16.563-0400 7f952ec60700 20 req 3660090314219285824 0.130001158s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, const rados::cls::fifo::update&, uint64_t):440 entering: tid=20070
2022-10-15T21:40:16.563-0400 7f952ec60700 20 req 3660090314219285824 0.130001158s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):624 entering: tid=20070
2022-10-15T21:40:16.563-0400 7f952ec60700 20 req 3660090314219285824 0.130001158s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::create, part_num: 2, part_tag: v4nCk2KzygJ2s8ql tid=20070
2022-10-15T21:40:16.563-0400 7f952ec60700 20 req 3660090314219285824 0.130001158s s3:put_obj int rgw::cls::fifo::FIFO::create_part(const DoutPrefixProvider*, int64_t, std::string_view, uint64_t, optional_yield):585 entering: tid=20070
2022-10-15T21:40:16.620-0400 7f952ec60700 20 req 3660090314219285824 0.187001660s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::set_head, part_num: 2, part_tag: v4nCk2KzygJ2s8ql tid=20070
2022-10-15T21:40:16.620-0400 7f952ec60700 20 req 3660090314219285824 0.187001660s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):681 postprocessing: i=0 tid=20070
2022-10-15T21:40:16.620-0400 7f952ec60700 20 req 3660090314219285824 0.187001660s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):464 entering: tid=20070
2022-10-15T21:40:16.639-0400 7f952ec60700 20 req 3660090314219285824 0.206001818s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, const rados::cls::fifo::update&, uint64_t):440 entering: tid=20070
2022-10-15T21:40:16.639-0400 7f952ec60700 20 req 3660090314219285824 0.206001818s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1358 preparing push: remaining=0 batch=1 retries=1 tid=20070
2022-10-15T21:40:16.639-0400 7f952ec60700 20 req 3660090314219285824 0.206001818s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1377 prepared push: remaining=0 batch=1 retries=1 batch_len=172 tid=20070
2022-10-15T21:40:16.639-0400 7f952ec60700 20 req 3660090314219285824 0.206001818s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1092 entering: tid=20070

#17 Updated by Adam Emerson 3 months ago

Thank you.

#18 Updated by Adam Emerson 3 months ago

  • Status changed from New to In Progress

Small reproducer turned out to not be, but fixing that.

#19 Updated by Oguzhan Ozmen 3 months ago

FWIW, below provides some log snippets with enhanced events. To be specific, some existing log events are added additional data points or new log events are added to the code path where pushing entries to a datalog shard is retried when the underlying rados object is full (i.e., when push_entries receive -ERANGE error code).

2 threads (0x7f6e16712700 and 0x7f6f2f944700) are getting -ERANGE almost simultaneously and going into the retry code path when pushing entries to the part data_log.31.xUl9Rofixnu.0 and ending up creating two new parts: .1 and .2

thread-1:

2022-10-25T19:15:18.723-0400 7f6e16712700 -1 req 867767205822499074 0.105000757s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1127 push_part failed: r=-34 tid=20663
2022-10-25T19:15:18.723-0400 7f6e16712700 20 req 867767205822499074 0.105000757s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):776 needs new head: tid=20663
2022-10-25T19:15:18.723-0400 7f6e16712700 20 req 867767205822499074 0.105000757s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):845 tid=20663 new_head_num: 1 max_push_part_num: 0 version: XLnKDnGtE/IoQhKN{3}
2022-10-25T19:15:18.723-0400 7f6e16712700  5 req 867767205822499074 0.105000757s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):770 new_head_part_num 0 version XLnKDnGtE/IoQhKN{3} tid=20663
2022-10-25T19:15:18.736-0400 7f6e16712700 20 req 867767205822499074 0.118000850s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::create, part_num: 1, part_tag: QAdGhsV8pTYurzDT tid=20663
2022-10-25T19:15:18.769-0400 7f6e16712700 -1 req 867767205822499074 0.151001081s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1120 calling push_part part_oid=data_log.31.xUl9Rofixnu.1 head_part_num 1 tag QAdGhsV8pTYurzDT tid=20663

thread-2:

2022-10-25T19:15:18.769-0400 7f6f2f944700 -1 req 15873780137161207524 0.149001077s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1127 push_part failed: r=-34 tid=20665
2022-10-25T19:15:18.769-0400 7f6f2f944700 20 req 15873780137161207524 0.149001077s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):776 needs new head: tid=20665
2022-10-25T19:15:18.769-0400 7f6f2f944700 20 req 15873780137161207524 0.149001077s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):845 tid=20665 new_head_num: 2 max_push_part_num: 1 version: XLnKDnGtE/IoQhKN{5}
2022-10-25T19:15:18.769-0400 7f6f2f944700  5 req 15873780137161207524 0.149001077s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, bool, uint64_t, optional_yield):770 new_head_part_num 1 version XLnKDnGtE/IoQhKN{5} tid=20665
2022-10-25T19:15:18.780-0400 7f6f2f944700 20 req 15873780137161207524 0.160001159s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::create, part_num: 2, part_tag: 2caD4XCrcHj7nxWc tid=20665
2022-10-25T19:15:18.839-0400 7f6f2f944700 -1 req 15873780137161207524 0.219001591s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1120  calling push_part part_oid=data_log.31.xUl9Rofixnu.2 head_part_num 2 tag 2caD4XCrcHj7nxWc tid=20665

The interesting observation is thread-2 hits -ERANGE after thread-1 processed the "Op::create part_num:1" and created the new head & part. Note that in the previous comment and in this comment, both master and secondary clusters have only 1 RGW node and each node has only 1 radosgw instance serving both client and replication traffic.

#20 Updated by Jane Zhu 3 months ago

FYI: We pulled in the 2 PRs Casey posted in the tracker https://tracker.ceph.com/issues/57783, and tested again with the newly built version. The issue was reproduced.

https://github.com/ceph/ceph/pull/47682
https://github.com/ceph/ceph/pull/48397

#21 Updated by Adam Emerson 3 months ago

  • Status changed from In Progress to Fix Under Review
  • Backport set to octopus,pacific,quincy
  • Pull request ID set to 48632

I have a candidate fix at https://github.com/ceph/ceph/pull/48632

#22 Updated by Adam Emerson 3 months ago

  • Backport changed from octopus,pacific,quincy to pacific,quincy

#23 Updated by Jane Zhu 3 months ago

Awesome! Thanks for the quick turn around! Will pull and test.

#24 Updated by Oguzhan Ozmen 3 months ago

Pulled the changes in on top of the commit 9056dbcdeaa7f4350b54a69f669982358ec5448e (on main branch). Unfortunately, the issue is still reproducible. Interestingly, more shards ended up being behind this time given the same test workload. On the secondary site, it was normally 1-2 shards that were falling behind in the of a test run but this time 24 of them did:

$ sudo radosgw-admin sync status
          realm d4c56a03-5c7d-458a-afda-2918aac80e6b (dev-realm)
      zonegroup 1897c676-48c1-4d0e-bbb4-6f73b3b193cf (dev-zonegroup)
           zone 0c4f1674-7ed1-4ece-88f3-9852f2cc526b (dev-zone-bcc-secondary)
   current time 2022-10-27T00:10:50Z
zonegroup features enabled: resharding
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 96a00bf5-7867-49c8-aadb-9b04ee73c303 (dev-zone-bcc-master)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 24 shards
                        behind shards: [0,1,2,8,14,16,17,24,27,28,29,42,46,59,63,74,75,82,91,93,97,105,122,124]

We didn't enable extra debugging levels in this test. In the next test we'll and then provide more details.

#25 Updated by Adam Emerson 3 months ago

I expect there are multiple problems with sync in Quincy, so I don't expect this to actually make sync work.

But if you can check if you can reproduce the race condition you found, I'd appreciate it.

(I don't actually think the race condition is likely to be the cause of that many sync issues, since it doesn't seem that the data should get lost, just misordered.)

#26 Updated by Oguzhan Ozmen 3 months ago

Hi Adam,

We obtained the extra logging with the fix in place.

I think the contention is not within _prepare_new_head but the contending thread (2nd thread getting -ERANGE for the same part) is getting `info.head_part_num` right before the first thread updates/sets the new head. I'll provide the full logs but below is a slice of it focusing on the issue:

  • Bucket: cosinc6-dev-1k-thisisbcstestload00111 / 5744cb84-9635-4292-a9c4-3f85c4d6c86f.5634.6456
  • Ended up having 2 missing object:
    • myobjects10000221
    • myobjects10000321
  • Relevant rados objects/parts
    • data_log.47.xTzOjGErjZm.0 -> this becomes full and the 2 contending threads create below objects ~simultaneously
    • data_log.47.xTzOjGErjZm.1
    • data_log.47.xTzOjGErjZm.2
  • Contending thread: 0x7f20b0a8d700 (thread-1) and 0x7f1fe38f3700 (thread-2)

thread-1: calling push_part with part_oid=data_log.47.xTzOjGErjZm.0 and gets -ERANGE and retries after creating part .1:

2022-10-27T01:02:17.150-0400 7f20b0a8d700 -1 req 11013122096762125929 0.043000448s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1152 blp  calling push_part part_oid=data_log.47.xTzOjGErjZm.0 head_part_num 0 tag 0MgMdoJ2Pyc5BL2c tid=20504
2022-10-27T01:02:17.167-0400 7f20b0a8d700 -1 req 11013122096762125929 0.060000625s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1159 blp push_part failed: r=-34 tid=20504
2022-10-27T01:02:17.167-0400 7f20b0a8d700 -1 req 11013122096762125929 0.060000625s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=20504
2022-10-27T01:02:17.167-0400 7f20b0a8d700  1 req 11013122096762125929 0.060000625s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):843 blp after locking before waiting cur_head_num 0 preparing 0 tid=20504
2022-10-27T01:02:17.167-0400 7f20b0a8d700  1 req 11013122096762125929 0.060000625s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):848 blp after waking up cur_head_num 0 info.head_part_num 0 preparing 0 tid=20504
2022-10-27T01:02:17.167-0400 7f20b0a8d700  1 req 11013122096762125929 0.060000625s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):862blp tid=20504 this should be the one to create new part  info.head_part_num: 0 new_head_num: 1 max_push_part_num: 0 version: BgQAeWDss+6wXEYl{3}
2022-10-27T01:02:17.167-0400 7f20b0a8d700  1 req 11013122096762125929 0.060000625s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):877blp tid=20504 after notifying cond info.head_part_num: 0 new_head_num: 1 max_push_part_num: 0 version: BgQAeWDss+6wXEYl{3}
2022-10-27T01:02:17.186-0400 7f20b0a8d700 20 req 11013122096762125929 0.079000816s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::create, part_num: 1, part_tag: AbAknPDoRzoBiCbo tid=20504
2022-10-27T01:02:17.233-0400 7f20b0a8d700 20 req 11013122096762125929 0.126001313s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::set_head, part_num: 1, part_tag: AbAknPDoRzoBiCbo tid=20504

# thread-1 finally pushing entries to .1 at the next try
2022-10-27T01:02:17.242-0400 7f20b0a8d700 -1 req 11013122096762125929 0.135001406s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1152 blp  calling push_part part_oid=data_log.47.xTzOjGErjZm.1 head_part_num 1 tag AbAknPDoRzoBiCbo tid=20504

Note that set_head took place at 01:02:17.233.

thread-2: is not contending with thread-1 in creating the new head but just misses the set_head and tries to push_entries to the full part. That's, it's also calling push_part with part_oid=data_log.47.xTzOjGErjZm.0. Note the time when thread-2 is pushing entries: the same timestamp as the setting head above thus thread-2 is missing the update and thus it creates a new part (.2).

2022-10-27T01:02:17.233-0400 7f1fe38f3700 -1 req 16595717534375518067 0.090000935s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1152 blp  calling push_part part_oid=data_log.47.xTzOjGErjZm.0 head_part_num 0 tag 0MgMdoJ2Pyc5BL2c tid=20505
2022-10-27T01:02:17.253-0400 7f1fe38f3700 -1 req 16595717534375518067 0.110001147s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1159 blp push_part failed: r=-34 tid=20505
2022-10-27T01:02:17.253-0400 7f1fe38f3700 -1 req 16595717534375518067 0.110001147s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=20505
2022-10-27T01:02:17.253-0400 7f1fe38f3700  1 req 16595717534375518067 0.110001147s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):843 blp after locking before waiting cur_head_num 1 preparing 0 tid=20505
2022-10-27T01:02:17.253-0400 7f1fe38f3700  1 req 16595717534375518067 0.110001147s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):848 blp after waking up cur_head_num 1 info.head_part_num 1 preparing 0 tid=20505
2022-10-27T01:02:17.253-0400 7f1fe38f3700  1 req 16595717534375518067 0.110001147s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):862blp tid=20505 this should be the one to create new part  info.head_part_num: 1 new_head_num: 2 max_push_part_num: 1 version: BgQAeWDss+6wXEYl{5}
2022-10-27T01:02:17.253-0400 7f1fe38f3700  1 req 16595717534375518067 0.110001147s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):877blp tid=20505 after notifying cond info.head_part_num: 1 new_head_num: 2 max_push_part_num: 1 version: BgQAeWDss+6wXEYl{5}
2022-10-27T01:02:17.278-0400 7f1fe38f3700 20 req 16595717534375518067 0.135001406s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::create, part_num: 2, part_tag: rFoU3XYqoe05T5pJ tid=20505
2022-10-27T01:02:17.309-0400 7f1fe38f3700 20 req 16595717534375518067 0.166001722s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::set_head, part_num: 2, part_tag: rFoU3XYqoe05T5pJ tid=20505

# thread-2 finally pushing entries to .2 at the next try
2022-10-27T01:02:17.328-0400 7f1fe38f3700 -1 req 16595717534375518067 0.185001925s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1152 blp  calling push_part part_oid=data_log.47.xTzOjGErjZm.2 head_part_num 2 tag rFoU3XYqoe05T5pJ tid=20505

That's, the contention is not within creating the new head/part per-se but threads are missing each other's updates.

I believe the fix can be extended a bit (time window-wise) - after getting -ERANGE we can check whether preparing knob is on and wait on it and this knob should be cleared in the very end of creating head&part.

#27 Updated by Adam Emerson 3 months ago

Pushed a newer fix that does the check in need_new_head()

#28 Updated by Adam Emerson 3 months ago

Pushed a newer, newer fix that guards all calls to _prepare_new_head behind check/set of preparing.

#29 Updated by Adam Emerson 3 months ago

Pushed a new version that should make listing list all the objects reliably.

#30 Updated by Oguzhan Ozmen 3 months ago

Thank you Adam. We'll test with the latest change.

#31 Updated by Krunal Chheda 3 months ago

Hey Adam,
Just a heads-up we tested with latest commit and we still see the issue.
The issue is seen when running MULTIPLE RGW instances, The fix is working fine when running with single RGW instance, but running with multiple RGW hosts (and each host having multiple instances), the issue is reproduced.
Looking at non-debug logs, we see the same issue where the multiple RGW instances are retrying to create new PART at SAME TIME, resulting in same issue where one of the part is not FULL and that ORPHAN part returns empty data when queried by secondary.
Just putting in some details below for one of the object part that is orphaned.
So on sync status we see, datalog_shard_38 is behind. And when we look at all the parts from datalog_38, we see following:

sudo rados -p dev-zone-bcc-master.rgw.log ls --all | grep  data_log.38.vX0E05td12R
    data_log.38.vX0E05td12R.4
    data_log.38.vX0E05td12R.8
    data_log.38.vX0E05td12R.12
    data_log.38.vX0E05td12R.11
    data_log.38.vX0E05td12R.5
    data_log.38.vX0E05td12R.10
    data_log.38.vX0E05td12R.6
    data_log.38.vX0E05td12R.7
    data_log.38.vX0E05td12R.9

Of all the parts above, we see the part_4 is the orphaned part and we confirmed by comparing size of all the parts, all other parts have size 4Mb, while part 4 has size 407K.
[cloud-user@dev-10-74-2-209 ~]$ ls -lah data_log.7
-rw-r--r--. 1 root root 4.0M Nov  1 12:50 data_log.7
[cloud-user@dev-10-74-2-209 ~]$ ls -lah data_log.4
-rw-r--r--. 1 root root 407K Nov  1 12:50 data_log.4

Even the s3_rgw logs, the secondary is only trying to get the datalog from part 4, it never moves on to part5, as part 4 is not FULL. So all the updates in part5,6,7... are never synced. The size of the response payload is just 44, indicating that primary returned pretty much empty data when tried to read the part 4.
And we see the below log line at regular intervals. The datalog read for shard_38 never moves to part5 !
2022-11-01T13:14:05.250-0400 7f04f7e96700  1 beast: 0x7f02a60f0700: 10.74.2.110 - dev-user [01/Nov/2022:13:14:05.248 -0400] "GET /admin/log/?type=data&id=38&marker=00000000000000000004%3A00000000000000416060&extra-info=true&rgwx-zonegroup=aa15f170-a940-44f8-a4e8-e37dd1f7ee1e HTTP/1.1" 200 44 - - - latency=0.001999982s

Since we do not have the debug logs, we cannot provide exact details that the race condition was due to MULTI-PROCESS rgw instances, but based on other logs its eventually confirmed that issue is the same (orphan part created due to race condition).

So looking at current fix, we lock the different rgw threads to go in and create new part, but what about the issue where different rgw instance's going and creating the new parts ?

#32 Updated by Adam Emerson 3 months ago

That's the point of the commit `rgw/fifo: `part_full` is not a reliable indicator`. There is no 'orphan part' in that case, since if there's a short part that never gets filled up, we just read it and keep listing from the next one. This should make it robust against that issue occurring.

Though not having it occur at all is also useful and I'll see about ruling it out in terms of how we handle the journal

#33 Updated by Adam Emerson 3 months ago

Ah, I see, I need to update the async lister.

#34 Updated by Krunal Chheda 3 months ago

Agree as you mentioned, the other solution could be, secondary not limited to just listening on to orpan part, but continue listing and read the next part if available, but i feel this solution is more of a band-aid to our current issue where due to race condition one of the part is not completely written.
If somehow there is tight coupling where new part creation only happens without race condition this issue will never occur.

Regarding continuing of listing from other parts, the logic is currently master tells the secondary about the NEXT part and NEXT marker it needs to query for the subsequent listing, and since the orphan part does not have any more data and nor does it have info about other parts it does not indicate secondary to go an list other part ? Correct?

#35 Updated by Adam Emerson 3 months ago

Pushed a new version with what should be a fix for multi-thread and multi-client races.

#36 Updated by Oguzhan Ozmen 3 months ago

Adam Emerson wrote:

Pushed a new version with what should be a fix for multi-thread and multi-client races.

We have NOT tested this most recent changes yet. Perhaps, it's not relevant anymore given this recent changes but below shows the result of the testing of the HEAD at https://github.com/ceph/ceph/commit/287b6279a19fe2599c519f0f61f25d7d4563bbe1 (which introduces the "at_head" logic the first time).

Below log snippet shows that, although the contention within/intra rgw-instance is resolved, we still see inter rgw-instance contention. In this test, we used 2 physical rgw nodes each running a single rgw instance. Below shows that, now, these 2 nodes can separately create a new part ~simultaneously for a given data_log shard.

  1 RGW_HOST-1 (tid=20761):
  2
  3 2022-11-01T21:20:12.623-0400 7f8ceffd4700 20 rgw data changes log: int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):    1143 entering: tid=20761
  4 2022-11-01T21:20:12.652-0400 7f8ceffd4700 -1 rgw data changes log: int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):    1151 blp  calling push_part part_oid=data_log.66.pyr8pUHPTsL.1 head_part_num 1 tag jwYy5ikhwXNv8a1H tid=20761
  5 2022-11-01T21:20:12.681-0400 7f8ceffd4700 -1 rgw data changes log: int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):    1158 blp push_part failed: r=-34 tid=20761
  6 2022-11-01T21:20:12.681-0400 7f8ceffd4700  1 rgw data changes log: int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):846blp tid=20761 info.head_part_num: 1 new_    head_num: 2 max_push_part_num: 1 version: F/4VrjoilGtNtX+r{8}
  7 2022-11-01T21:20:12.699-0400 7f8ceffd4700 20 rgw data changes log: int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::create, p    art_num: 2, part_tag: aHFwEuREmpF4O7H1 tid=20761
  8 2022-11-01T21:20:12.724-0400 7f8ceffd4700 20 rgw data changes log: int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing entry: entry=op: Op::set_head,     part_num: 2, part_tag: aHFwEuREmpF4O7H1 tid=20761
  9 2022-11-01T21:20:12.746-0400 7f8ceffd4700 -1 rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield)::<lambda()>:858blp within make_scope_guard tid=20761 info.head_part_n    um: 2 info.max_push_part_num: 2 preparing: 1
 10 2022-11-01T21:20:12.746-0400 7f8ceffd4700 -1 rgw data changes log: int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):    1151 blp  calling push_part part_oid=data_log.66.pyr8pUHPTsL.2 head_part_num 2 tag aHFwEuREmpF4O7H1 tid=20761
 11
 12 RGW_HOST-2 (tid=20152):
 13
 14 2022-11-01T21:20:12.308-0400 7fda80f8d700 20 req 3984248768763079002 0.123002090s s3:put_obj int rgw::cls::fifo::FIFO   (const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, o    ptional_yield):1143 entering: tid=20152
 15 2022-11-01T21:20:12.791-0400 7fdb398fe700 -1 req 12237853394960534883 0.156002656s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&,     uint64_t, optional_yield):1151 blp  calling push_part part_oid=data_log.66.pyr8pUHPTsL.1 head_part_num 1 tag jwYy5ikhwXNv8a1H tid=20152
 16 2022-11-01T21:20:12.840-0400 7fdb398fe700 -1 req 12237853394960534883 0.205003485s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, s    td::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=20152
 17 2022-11-01T21:20:12.840-0400 7fdb398fe700  1 req 12237853394960534883 0.205003485s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield):846blp tid=20152     info.head_part_num: 2 new_head_num: 3 max_push_part_num: 2 version: F/4VrjoilGtNtX+r{10}
 18 2022-11-01T21:20:12.859-0400 7fdb398fe700 20 req 12237853394960534883 0.224003807s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing ent    ry: entry=op: Op::create, part_num: 3, part_tag: RJH9m8hIQrSsMHcG tid=20152
 19 2022-11-01T21:20:12.942-0400 7fdb398fe700 20 req 12237853394960534883 0.307005227s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):637 processing ent    ry: entry=op: Op::set_head, part_num: 3, part_tag: RJH9m8hIQrSsMHcG tid=20152
 20 2022-11-01T21:20:12.953-0400 7fdb398fe700 -1 rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, uint64_t, optional_yield)::<lambda()>:858blp within make_scope_guard tid=20152 info.head_part_n    um: 3 info.max_push_part_num: 3 preparing: 1
 21 2022-11-01T21:20:12.953-0400 7fdb398fe700 -1 req 12237853394960534883 0.318005413s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&,     uint64_t, optional_yield):1151 blp  calling push_part part_oid=data_log.66.pyr8pUHPTsL.3 head_part_num 3 tag RJH9m8hIQrSsMHcG tid=20152
  • Lines 4 and 15 show that both nodes are trying to push entries to data_log.66.pyr8pUHPTsL.1 and both fails since .1 is full: Lines 5 & 16, getting -34 return code. This puts both nodes into the retry logic.
  • Lines 6 and 17 show that how two threads from 2 different nodes interleave. Running ~160 msec later host-2 at line 17 bumps up the part number one more time after HOST-1 has already incremented it to 2.
  • Finally, Lines 10 and 21 show that the updates go to 2 different parts/rados-objects at the next retry.

Added above analysis just to be complete. We'll be testing the new changes which should hopefully fix multi-client (across rgw instances) races as well.

#37 Updated by Adam Emerson 3 months ago

Yeah, both those commits are gone, make sure you only have the newest one.

#38 Updated by Oguzhan Ozmen 3 months ago

We are still testing the latest evidence (HEAD at https://github.com/ceph/ceph/commit/cfc3bde36dbc9c6e0b7182bbb32539019ba2dd61) but the initial testing with increased rgw debug level (20) and including some custom log events could reproduce the issue in a slightly different way than before (as expected).

This time there's no orphaned part: there's no 2 parts created by separate threads or instances at the same time - which is good. Now, we see that a new part cannot be created after a cancellation. Existing logs captures only some of the details; let's see whether it'll be sufficient for now to root-cause the problem. In the meantime, we are planning to add some more log events to have finer granular datapoints. For now, below is the initial analysis; let's see they'll be sufficient.

  • Bucket: cosinc10-dev-1k-thisisbcstestload00213 (43e66047-e6b5-472d-8813-6224a31ce963.5374.5897)
  • 2 Objects Missing on the secondary site: myobjects10001743 and myobjects10001783
  • Bucket Index Shard that missing objects fall onto: 6
  • Datalog shard it corresponds to: 127.

There's no orphaned datalog shard object in this case:

$ sudo rados -p dev-zone-bcc-master.rgw.log ls | grep "data_log.127" 
data_log.127
data_log.127.WkMkUiakL0/.6

The current status of datalog shard 127:

shard 127
    {
        "marker": "00000000000000000006:00000000000004161441",
        "last_update": "2022-11-03T02:04:27.862783Z" 
    }

The issue started at around 22:04:28.453 EST (~500 msec after the `_last_update_`): creating part 7 failed the first time at this point and all the subsequent attempts failed as well with -17 (already exists EEXIST error code). E.g.,

2022-11-02T22:04:28.532-0400 7f07b255b700 -1 req 15462255817190376745 0.161001116s s3:put_obj int rgw::cls::fifo::FIFO::create_part(const DoutPrefixProvider*, int64_t, std::string_view, uint64_t, optional_yield):597 part_init failed: r=-17 tid=72267
2022-11-02T22:04:28.532-0400 7f07b255b700 -1 req 15462255817190376745 0.161001116s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):870 inconsistency, push part less than head part:  tid=72267

Below is the content of underlying rados objects for datalog shard 127:

# content of data_log.127 object
$ cat data_log.127.txt
data_log.127 a2lqXIRnGQfmStV8 data_log.127.WkMkUiakL0/@? 4wzV7UX5fcP6Mnb1 3aI7McOwlcnVGC2K h3s2iIcPp6nE1EIB 4wzV7UX5fcP6Mnb1 4wzV7UX5fcP6Mnb1 3aI7McOwlcnVGC2K 3aI7McOwlcnVGC2K

# looking for tags with part 6 object:
$ egrep -o "4wzV7UX5fcP6Mnb1" data_log.127.6.txt
$ egrep -o "3aI7McOwlcnVGC2K" data_log.127.6.txt
$ egrep -o "a2lqXIRnGQfmStV8" data_log.127.6.txt
$ egrep -o "h3s2iIcPp6nE1EIB" data_log.127.6.txt
  Binary file data_log.127.6 matches

If I'm not mistaken "a2lqXIRnGQfmStV8" is the version used when creating a new head since it's the one used all 7 _prepare_new_head requests (for shard 127) during the course of the test.

Looking into the transaction that first failed to create part 7:

  • tid=72267
  • req=15462255817190376745

I'll attach the full log lines for this transaction*. Below is a slice of the logs showing that rados operation was sucessful but we failed to update metadata - and we mark this situation as a cancellation - which could be source of the issue:

  1 # part full - retry after creating a new head
  2 2022-11-02T22:04:28.453-0400 7f07b255b700 -1 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&,     uint64_t, optional_yield):1172 blp push_part failed: r=-34 tid=72267
  3 2022-11-02T22:04:28.453-0400 7f07b255b700 -1 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, s    td::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=72267
  4 # current part num is 6 - we need to create part 7
  5 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, option    al_yield):1458 blp need to retry and need new head  head_part_num=6 tid=72267
  6 2022-11-02T22:04:28.453-0400 7f07b255b700  1 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):842 blp     entering: tid=72267
  7 # we should be creating 'data_log.127.WkMkUiakL0/.7'
  8 2022-11-02T22:04:28.453-0400 7f07b255b700  1 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):847blp     tid=72267 info.head_part_num: 6 max_push_part_num: 6 new_head_part_num: 7 info.part_oid(new_head_part_num): data_log.127.WkMkUiakL0/.7 version: a2lqXIRnGQfmStV8{29}
  9 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):859 nee    d new part: tid=72267
 10 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):7    55blp entering: tid=72267
 11 2022-11-02T22:04:28.453-0400 7f07b255b700  5 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):7    72 blp new_part_num 7 is_head 1 version a2lqXIRnGQfmStV8{29} tid=72267
 12 # new head needed
 13 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):7    79 blp needs new head: tid=72267
 14 # journal content seems accurate at this point
 15 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):7    84blp tid=72267 jentries.front().part_num 7 jentries.back().part_num 7
 16 # updating metadata
 17 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):7    96blp updating metadata: i=0 tid=72267

 18 #
 19 # looks like it's cancelled
 20 # rgw_rados_operate must have returned -ECANCELED
 21 #
 22 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fi    fo::objv, bool*, uint64_t, optional_yield):465 entering: tid=72267
 23 2022-11-02T22:04:28.453-0400 7f07b255b700 20 req 15462255817190376745 0.082000576s s3:put_obj WARNING: blocking librados call
 24 2022-11-02T22:04:28.470-0400 7f07b255b700 20 req 15462255817190376745 0.099000692s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1289 entering: tid=72267
 25 # this should be returning 0 since we don't see "returning error: " event but only cancellation
 26 2022-11-02T22:04:28.479-0400 7f07b255b700 20 req 15462255817190376745 0.108000755s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fi    fo::objv, bool*, uint64_t, optional_yield):484 canceled: tid=72267

At this point we should be at https://github.com/ceph/ceph/blob/cfc3bde36dbc9c6e0b7182bbb32539019ba2dd61/src/rgw/cls_fifo_legacy.cc#L785 . I don't see any "blp raced, journaled but not processed: i=0" error so for some reason this entry is not in the journal anymore?

 36 #
 37 # cancelled again - but this time it's a bit different
 38 # see https://github.com/ceph/ceph/blob/cfc3bde36dbc9c6e0b7182bbb32539019ba2dd61/src/rgw/cls_fifo_legacy.cc#L473
 39 # rgw_rados_operate did not return -ECANCELED this time
 40 # because we see apply_updates! So the return code from rados_operate must be >= 0 this time.
 41 # and apply_update must have returned an error since we do a read_meta right after.
 42 #
 43 2022-11-02T22:04:28.501-0400 7f07b255b700 20 req 15462255817190376745 0.130000919s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo    ::objv&, const rados::cls::fifo::update&, uint64_t):440 entering: tid=72267
 44 2022-11-02T22:04:28.501-0400 7f07b255b700 -1 req 15462255817190376745 0.130000919s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo    ::objv&, const rados::cls::fifo::update&, uint64_t):444 version mismatch, canceling: tid=72267
 45 2022-11-02T22:04:28.501-0400 7f07b255b700 20 req 15462255817190376745 0.130000919s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1289 entering: tid=72267
 46 # returning 0 since we don't see "returning error: " event
 47 2022-11-02T22:04:28.505-0400 7f07b255b700 20 req 15462255817190376745 0.134000942s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fi    fo::objv, bool*, uint64_t, optional_yield):484 canceled: tid=72267
  • Would it cause the problem having rgw_rados_operate successful while we cannot apply the update because of version mismatch?
  • What would version mismatch mean here?

At this point we should be again at https://github.com/ceph/ceph/blob/cfc3bde36dbc9c6e0b7182bbb32539019ba2dd61/src/rgw/cls_fifo_legacy.cc#L785 .Now I see "blp raced, journaled but not processed: i=1" this time (i==1). The entry is found in the journal???

  • Why couldn't it find in the journal the first time (i==0)???
 55 2022-11-02T22:04:28.505-0400 7f07b255b700 20 req 15462255817190376745 0.134000942s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):8    11blp raced, journaled but not processed: i=1 tid=72267

The most interesting part is now when processing the journal we see part_num 6 but not 7????? Hence all the r=-17 streaming errors.

 62 2022-11-02T22:04:28.505-0400 7f07b255b700 20 req 15462255817190376745 0.134000942s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):643blp create entr    y.part_num 6entry.part_tag 4wzV7UX5fcP6Mnb1 tid=72267
 63 2022-11-02T22:04:28.505-0400 7f07b255b700 20 req 15462255817190376745 0.134000942s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):638 processing ent    ry: entry=op: Op::create, part_num: 6, part_tag: 4wzV7UX5fcP6Mnb1 tid=72267
 64 2022-11-02T22:04:28.532-0400 7f07b255b700 -1 req 15462255817190376745 0.161001116s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):674 processing ent    ry failed: entry=op: Op::create, part_num: 6, part_tag: 4wzV7UX5fcP6Mnb1 r=-17 tid=72267
 65 2022-11-02T22:04:28.532-0400 7f07b255b700 -1 req 15462255817190376745 0.161001116s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):870 inc    onsistency, push part less than head part:  tid=72267

#39 Updated by Oguzhan Ozmen 3 months ago

I think if the create_part is made exclusive, one of them would fail at part creation and let the other complete part creation and setting head. I believe version check is there to avoid this exact situation - another solution would be to revisit versioning framework perhaps.

The create_part already is exclusive. I added two unit tests to confirm that it's gated properly. I think what you're seeing is likely a combination of versions failing to update, and info updates potentially making changes to the tags THEN failing.

You can check the two unit tests if you like. Trying to put two 'create' entries in the journal for the same part fails, and recreating a part with a different tag also fails with EEXIST.

#40 Updated by Jane Zhu 3 months ago

Here is some more detailed explanation on how the -EINVAL error (hence datalog writing failure) happens based on the analysis of our debug logs and the source code.
Please note that there are a bunch of custom logs that we added in so the line numbers and log lines won't match with the github code.

We are experimenting some changes in the versioning and update_meta logic, and will update if we find anything.

----------------------------------------
It seems the info::version incrementation in each update_meta() causes many issues. Threads/instances step on each other's toes.
Here is one of the issues it causes, which leads to the -EINVAL error in cls_fifo.cc.

This -EINVAL error is caused by missing tag in the push_part operation.
Please see the following rgw logs:

2022-11-07T13:36:51.819-0500 7f6831994700  0 req 10582482903175471619 0.195999071s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1464 tid=984 info=(id: data_log.93, version: 3qp7OMPX9DEIa569{2469}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 147, min_push_part_num: 0, max_push_part_num: 147, head_tag: , tags: {{}}, journal: {{{}})
2022-11-07T13:36:51.820-0500 7f6831994700  1 -- 10.74.4.158:0/4067375476 --> [v2:10.74.3.187:6816/717555,v1:10.74.3.187:6817/717555] -- osd_op(unknown.0.0:117220 3.b 3:d79fa440:::data_log.93.NVOVMAR1C3d.147:head [call fifo.push_part in=156b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e588) v8 -- 0x562d89e5d400 con 0x562d7daaac00

2022-11-07T13:36:51.924-0500 7f6939410700  1 -- 10.74.4.158:0/4067375476 <== osd.7 v2:10.74.3.187:6816/717555 9926 ==== osd_op_reply(117220 data_log.93.NVOVMAR1C3d.147 [call] v588'410337 uv410328 ondisk = -22 ((22) Invalid ar
gument)) v8 ==== 171+0+0 (crc 0 0 0) 0x562d81872d80 con 0x562d7daaac00

2022-11-07T13:36:51.924-0500 7f6831994700 -1 req 10582482903175471619 0.300998598s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-22 tid=984

And osd logs:

2022-11-07T13:36:51.924-0500 7ff8a0e99700  0 <cls> /root/rpmbuild/BUILD/ceph-18.0.0_main_20221031_pr5f_jzhu4-0.bb.wip-57562-custom/src/cls/fifo/cls_fifo.cc:480: int rados::cls::fifo::{anonymous}::push_part(cls_method_context_t, ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): tag required

The missing tag is due to the tags have been wrong removed.

Here is how the tags are wrongly removed according to what we observed in the logs and source code.
1. Rgw1 handles journal entries for a info::version, say "3qp7OMPX9DEIa569{2460}" here.

2022-11-07T13:36:51.739-0500 7f6831994700  0 req 10582482903175471619 0.115999453s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):653 tid=984 info=(id: data_log.93, version: 3qp7OMPX9DEIa569{2460}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 146, min_push_part_num: 0, max_push_part_num: 146, head_tag: nsKMOzD6u16yXzXc, tags: {{146=nsKMOzD6u16yXzXc,147=N8Ep6tuiBruDb4zD}}, journal: {{{147=op: Op::create, part_num: 147, part_tag: N8Ep6tuiBruDb4zD,147=op: Op::set_head, part_num: 147, part_tag: N8Ep6tuiBruDb4zD}})
2022-11-07T13:36:51.739-0500 7f6831994700  0 req 10582482903175471619 0.115999453s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):658 processing entry: entry=op: Op::create, part_num: 147, part_tag: N8Ep6tuiBruDb4zD tid=984
2022-11-07T13:36:51.739-0500 7f6831994700  1 -- 10.74.4.158:0/4067375476 --> [v2:10.74.3.187:6816/717555,v1:10.74.3.187:6817/717555] -- osd_op(unknown.0.0:116958 3.b 3:d79fa440:::data_log.93.NVOVMAR1C3d.147:head [create,call fifo.init_part in=69b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e588) v8 -- 0x562d89ac2400 con 0x562d7daaac00

2022-11-07T13:36:51.772-0500 7f6939410700  1 -- 10.74.4.158:0/4067375476 <== osd.7 v2:10.74.3.187:6816/717555 9902 ==== osd_op_reply(116958 data_log.93.NVOVMAR1C3d.147 [create,call] v588'410309 uv410309 ondisk = 0) v8 ==== 21
3+0+0 (crc 0 0 0) 0x562d81872d80 con 0x562d7daaac00

2022-11-07T13:36:51.772-0500 7f6831994700  0 req 10582482903175471619 0.148999289s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):658 processing entry: entry=op: Op::set_head, part_num: 147, part_tag: N8Ep6tuiBruDb4zD tid=984

2. Some other rgw instance calls update_meta and bumps up the info::version to a higher number, say "3qp7OMPX9DEIa569{2463}".
3. Another thread in Rgw1 read_meta and update the local info with version "3qp7OMPX9DEIa569{2463}".

2022-11-07T13:36:51.772-0500 7f65dece0700  0 req 16283652365705957296 0.153999284s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1314 tid=983 info=(id: data_log.
93, version: 3qp7OMPX9DEIa569{2463}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 147, min_push_part_
num: 0, max_push_part_num: 147, head_tag: N8Ep6tuiBruDb4zD, tags: {{147=N8Ep6tuiBruDb4zD}}, journal: {{{}})

4. Rgw1 goes to post processing in process_journal, and clean up tags and journals for version "3qp7OMPX9DEIa569{2463}".

2022-11-07T13:36:51.772-0500 7f6831994700  0 req 10582482903175471619 0.148999289s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):702 postprocessing: i=0 tid=984
...
(The above log in item 3 happens here.)
...
2022-11-07T13:36:51.772-0500 7f6831994700  1 -- 10.74.4.158:0/4067375476 --> [v2:10.74.0.198:6816/716921,v1:10.74.0.198:6817/716921] -- osd_op(unknown.0.0:117059 3.d 3:b2ddf551:::data_log.93:head [call fifo.update_meta in=167b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e588) v8 -- 0x562d7dcd8c00 con 0x562d7dad6c00
2022-11-07T13:36:51.784-0500 7f6939410700  1 -- 10.74.4.158:0/4067375476 <== osd.6 v2:10.74.0.198:6816/716921 12793 ==== osd_op_reply(117059 data_log.93 [call] v588'285339 uv285339 ondisk = 0) v8 ==== 155+0+0 (crc 0 0 0) 0x562d81872d80 con 0x562d7dad6c00

https://github.com/ceph/ceph/blob/main/src/rgw/cls_fifo_legacy.cc#L704
https://github.com/ceph/ceph/blob/main/src/cls/fifo/cls_fifo_types.h#L405

5. The similar thing happens in some other rgw instances for version "3qp7OMPX9DEIa569{2464}".
6. Rgw 1 read_meta and gets

2022-11-07T13:36:51.786-0500 7f6831994700  0 req 10582482903175471619 0.162999228s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1314 tid=984 info=(id: data_log.93, version: 3qp7OMPX9DEIa569{2464}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 147, min_push_part_num: 0, max_push_part_num: 147, head_tag: , tags: {{}}, journal: {{{}})

We can see the head_tag and tags are all empty.

7. The following update_meta is all on empty tags.

2022-11-07T13:36:51.786-0500 7f6831994700  0 req 10582482903175471619 0.162999228s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):474 tid=984 version=3qp7OMPX9DEIa569{2464} info=(id: data_log.93, version: 3qp7OMPX9DEIa569{2464}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 147, min_push_part_num: 0, max_push_part_num: 147, head_tag: , tags: {{}}, journal: {{{}})

8. After a few a few "-ECANCELED" and read_meta iterations, it reaches to version "3qp7OMPX9DEIa569{2468}" which has empty tags as well due to reasons described above.

2022-11-07T13:36:51.812-0500 7f6831994700  0 req 10582482903175471619 0.188999102s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1314 tid=984 info=(id: data_log.93, version: 3qp7OMPX9DEIa569{2468}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 147, min_push_part_num: 0, max_push_part_num: 147, head_tag: , tags: {{}}, journal: {{{}})

9. Rgw1 performed a successful update_meta, and bump the version to "3qp7OMPX9DEIa569{2469}".

2022-11-07T13:36:51.812-0500 7f6831994700  0 req 10582482903175471619 0.188999102s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):474 tid=984 version=3qp7OMPX9DEIa569{2468} info=(id: data_log.93, version: 3qp7OMPX9DEIa569{2468}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 147, min_push_part_num: 0, max_push_part_num: 147, head_tag: , tags: {{}}, journal: {{{}})
2022-11-07T13:36:51.812-0500 7f6831994700  1 -- 10.74.4.158:0/4067375476 --> [v2:10.74.0.198:6816/716921,v1:10.74.0.198:6817/716921] -- osd_op(unknown.0.0:117201 3.d 3:b2ddf551:::data_log.93:head [call fifo.update_meta in=75b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e588) v8 -- 0x562d8e3e0c00 con 0x562d7dad6c00

...
2022-11-07T13:36:51.819-0500 7f6831994700  0 req 10582482903175471619 0.160999238s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):481 rgw_rados_operate r=0.

10. Rgw1 does a push_part on version "3qp7OMPX9DEIa569{2469}" with empty tag and results in -EINVAL error.
The datalog writing failed.

2022-11-07T13:36:51.819-0500 7f6831994700  0 req 10582482903175471619 0.195999071s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1464 tid=984 info=(id: data_log.93, version: 3qp7OMPX9DEIa569{2469}, oid_prefix: data_log.93.NVOVMAR1C3d, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 121, head_part_num: 147, min_push_part_num: 0, max_push_part_num: 147, head_tag: , tags: {{}}, journal: {{{}})
2022-11-07T13:36:51.820-0500 7f6831994700  1 -- 10.74.4.158:0/4067375476 --> [v2:10.74.3.187:6816/717555,v1:10.74.3.187:6817/717555] -- osd_op(unknown.0.0:117220 3.b 3:d79fa440:::data_log.93.NVOVMAR1C3d.147:head [call fifo.push_part in=156b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e588) v8 -- 0x562d89e5d400 con 0x562d7daaac00

2022-11-07T13:36:51.924-0500 7f6939410700  1 -- 10.74.4.158:0/4067375476 <== osd.7 v2:10.74.3.187:6816/717555 9926 ==== osd_op_reply(117220 data_log.93.NVOVMAR1C3d.147 [call] v588'410337 uv410328 ondisk = -22 ((22) Invalid ar
gument)) v8 ==== 171+0+0 (crc 0 0 0) 0x562d81872d80 con 0x562d7daaac00

2022-11-07T13:36:51.924-0500 7f6831994700 -1 req 10582482903175471619 0.300998598s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-22 tid=984
2022-11-07T13:36:51.924-0500 7f67606f0700 -1 req 734081344771729673 0.243998855s s3:put_obj ERROR: failed writing data log
2022-11-07T13:36:51.924-0500 7f6831994700 -1 req 10582482903175471619 0.300998598s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1178 push_part failed: r=-22 tid=984
2022-11-07T13:36:51.924-0500 7f6831994700 -1 req 10582482903175471619 0.300998598s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1484 push_entries failed: r=-22 tid=984
2022-11-07T13:36:51.924-0500 7f656b1f5700  1 -- 10.74.4.158:0/4067375476 --> [v2:10.74.0.198:6800/713875,v1:10.74.0.198:6801/713875] -- osd_op(unknown.0.0:117408 5.9 5:90ae50f3:root::testwkr15-4h-1m-thisisbcstestload0055355:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e588) v8 -- 0x562d88345000 con 0x562d7cb9ac00
2022-11-07T13:36:51.924-0500 7f6831994700 -1 req 10582482903175471619 0.300998598s s3:put_obj virtual int RGWDataChangesFIFO::push(const DoutPrefixProvider*, int, ceph::real_time, const string&, ceph::buffer::v15_2_0::list&&): unable to push to FIFO: data_log.93: (22) Invalid argument
2022-11-07T13:36:51.924-0500 7f6831994700 -1 req 10582482903175471619 0.300998598s s3:put_obj ERROR: failed writing data log

#41 Updated by Jane Zhu 3 months ago

We also found a place that might potentially cause issues.

Rgw locks the mutex and gets some data from "info" before calling _update_meta.
https://github.com/ceph/ceph/blob/main/src/rgw/cls_fifo_legacy.cc#L688

Please note the following line:

auto objv = info.version;

However inside _update_meta, "info.version" is used instead of the previous "objv" retrieved within the lock.
https://github.com/ceph/ceph/blob/main/src/rgw/cls_fifo_legacy.cc#L468

The "info.version" could have been changed between L688 and L468 hence the meta data could have been written to rados object for wrong version.

#42 Updated by Jane Zhu 3 months ago

A potential bug?
https://github.com/ceph/ceph/blob/main/src/cls/fifo/cls_fifo_types.h#L66

Should it be the following?

(instance == rhs.instance && ver >= rhs.ver)

#43 Updated by Adam Emerson 3 months ago

It should, thank you. I don't think it's the underlying cause, but it's a good catch.

#44 Updated by Jane Zhu 3 months ago

Two more racing conditions found. These two are all on the journal entries.

Race condition 1:

~~~~~~~~~~~~~~~~~~
2022-11-13T16:15:02.547-0500 7ffa82fcf700  0 req 7298682643504930034 0.071999662s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv,
bool*, uint64_t, optional_yield):478 tid=50001 version=6GtY6TcMkvN3A3ne{328} update=(head_part_num: 41, max_push_part_num: 41, journal_entries_rm: {[op: Op::create, part_num: 41, part_tag: NtSYEATzFRYSRiGx,op: Op::se
t_head, part_num: 41, part_tag: NtSYEATzFRYSRiGx]}) info=(id: data_log.125, version: 6GtY6TcMkvN3A3ne{328}, oid_prefix: data_log.125.+0miUgwXRqr, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_thre
shold: 1015750}, tail_part_num: 32, head_part_num: 40, min_push_part_num: 0, max_push_part_num: 40, head_tag: ipn1iisEUG8C1b3T, tags: {{40=ipn1iisEUG8C1b3T,41=NtSYEATzFRYSRiGx}}, journal: {{{41=op: Op::create, part_n
um: 41, part_tag: NtSYEATzFRYSRiGx,41=op: Op::set_head, part_num: 41, part_tag: NtSYEATzFRYSRiGx}})

...
~~~~~~~~~~~~~~~~``

2022-11-13T16:15:02.563-0500 7ffa28f1b700  0 req 5761976872023625470 0.029999860s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):874 tid=50003 new
_head_part_num=41 info=(id: data_log.125, version: 6GtY6TcMkvN3A3ne{328}, oid_prefix: data_log.125.+0miUgwXRqr, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 32
, head_part_num: 40, min_push_part_num: 0, max_push_part_num: 40, head_tag: ipn1iisEUG8C1b3T, tags: {{40=ipn1iisEUG8C1b3T,41=NtSYEATzFRYSRiGx}}, journal: {{{41=op: Op::create, part_num: 41, part_tag: NtSYEATzFRYSRiGx
,41=op: Op::set_head, part_num: 41, part_tag: NtSYEATzFRYSRiGx}})
2022-11-13T16:15:02.563-0500 7ffa28f1b700  0 req 5761976872023625470 0.029999860s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):884 need new part
: tid=50003
2022-11-13T16:15:02.563-0500 7ffa28f1b700  0 req 5761976872023625470 0.029999860s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):788 tid=500
03 new_part_num=41 is_head=1

~~~~~~~~~~~
2022-11-13T16:15:02.563-0500 7ffa82fcf700  0 req 7298682643504930034 0.087999582s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, c
onst rados::cls::fifo::update&, uint64_t):467 tid=50001 info=(id: data_log.125, version: 6GtY6TcMkvN3A3ne{329}, oid_prefix: data_log.125.+0miUgwXRqr, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_
threshold: 1015750}, tail_part_num: 32, head_part_num: 41, min_push_part_num: 0, max_push_part_num: 41, head_tag: NtSYEATzFRYSRiGx, tags: {{41=NtSYEATzFRYSRiGx}}, journal: {{{}})
2022-11-13T16:15:02.563-0500 7ffa82fcf700  0 req 7298682643504930034 0.087999582s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv,
bool*, uint64_t, optional_yield):494 apply_update r=0
...
2022-11-13T16:15:02.563-0500 7ffa82fcf700  0 req 7298682643504930034 0.087999582s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):
1480 tid=50001 info=(id: data_log.125, version: 6GtY6TcMkvN3A3ne{329}, oid_prefix: data_log.125.+0miUgwXRqr, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 32, h
ead_part_num: 41, min_push_part_num: 0, max_push_part_num: 41, head_tag: NtSYEATzFRYSRiGx, tags: {{41=NtSYEATzFRYSRiGx}}, journal: {{{}})
~~~~~~~~~~~

2022-11-13T16:15:02.564-0500 7ffa28f1b700  0 req 5761976872023625470 0.030999856s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv,
bool*, uint64_t, optional_yield):478 tid=50003 version=6GtY6TcMkvN3A3ne{329} update=(journal_entries_add: {[op: Op::create, part_num: 41, part_tag: HGbsmifbAVhcwuvj,op: Op::set_head, part_num: 41, part_tag: HGbsmifbA
Vhcwuvj]}) info=(id: data_log.125, version: 6GtY6TcMkvN3A3ne{329}, oid_prefix: data_log.125.+0miUgwXRqr, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 32, head_
part_num: 41, min_push_part_num: 0, max_push_part_num: 41, head_tag: NtSYEATzFRYSRiGx, tags: {{41=NtSYEATzFRYSRiGx}}, journal: {{{}})

2022-11-13T16:15:02.569-0500 7ffa28f1b700  0 req 5761976872023625470 0.035999831s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv,
bool*, uint64_t, optional_yield):487 rgw_rados_operate r=0. -ECANCELED=-125

2022-11-13T16:15:02.569-0500 7ffa28f1b700  0 req 5761976872023625470 0.035999831s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, c
onst rados::cls::fifo::update&, uint64_t):442 tid=50003 objv=6GtY6TcMkvN3A3ne{329} *info=(id: data_log.125, version: 6GtY6TcMkvN3A3ne{330}, oid_prefix: data_log.125.+0miUgwXRqr, params: {max_part_size: 1048576, max_e
ntry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 32, head_part_num: 41, min_push_part_num: 0, max_push_part_num: 41, head_tag: NtSYEATzFRYSRiGx, tags: {{41=HGbsmifbAVhcwuvj}}, journal: {{{41=op: Op::cr
eate, part_num: 41, part_tag: HGbsmifbAVhcwuvj,41=op: Op::set_head, part_num: 41, part_tag: HGbsmifbAVhcwuvj}})

~~~~~~
2022-11-13T16:15:02.569-0500 7ff8993fc700  0 req 4801004184396876571 0.058999721s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1330 tid=50002 info=(id: data_log.125, version: 6GtY6TcMkvN3A3ne{330}, oid_prefix: data_log.125.+0miUgwXRqr, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 32, head_part_num: 41, min_push_part_num: 0, max_push_part_num: 41, head_tag: NtSYEATzFRYSRiGx, tags: {{41=HGbsmifbAVhcwuvj}}, journal: {{{41=op: Op::create, part_num: 41, part_tag: HGbsmifbAVhcwuvj,41=op: Op::set_head, part_num: 41, part_tag: HGbsmifbAVhcwuvj}})
~~~~~

There are 3 threads involved here: tid=50001, tid=50002, and tid=50003
Here is what happens in time order.

1. Tid 50001 finishes the handling of journal, and call _update_meta to set the values and remove the journal entries.
journal_entries_rm: {[op: Op::create, part_num: 41, part_tag: NtSYEATzFRYSRiGx,op: Op::set_head, part_num: 41, part_tag: NtSYEATzFRYSRiGx]})
version: 6GtY6TcMkvN3A3ne{328}

2. Tid 50003 check if new head and new part are needed. Since the changes in the above request has not been handled by osd yet, both return true.
version: 6GtY6TcMkvN3A3ne{328}

3. The request sent by tid 50001 is be ing processed successfully, and returns back to this rgw. The rgw calls apply_update and updates the local "info". The journal in "info" is empty now.
version: 6GtY6TcMkvN3A3ne{329}

4. Tid 50003 conitunes to call _update_meta with the "create" and "set_head" operations in jounral.
journal_entries_add: {[op: Op::create, part_num: 41, part_tag: HGbsmifbAVhcwuvj,op: Op::set_head, part_num: 41, part_tag: HGbsmifbAVhcwuvj]}
version: 6GtY6TcMkvN3A3ne{329}

5. The update succeeds and returns back and calls into apply_update.
version: 6GtY6TcMkvN3A3ne{329}

6. Tid 50002 calls read_meta and reads in the latest meta with version "6GtY6TcMkvN3A3ne{330}"
version: 6GtY6TcMkvN3A3ne{330}

7. In tid 50003, since the version in "update" and the version in "info" are different, the apply_update is failed with error code "-ECANDELED"
version: 6GtY6TcMkvN3A3ne{329}, 6GtY6TcMkvN3A3ne{330}
https://github.com/adamemerson/ceph/blob/wip-57562/src/rgw/cls_fifo_legacy.cc#L443

8. Trace the code from here, you can see the following journal entries stuck in the meta data forwver, which causes a lot of issues down the road.
journal: {{{41=op: Op::create, part_num: 41, part_tag: HGbsmifbAVhcwuvj,41=op: Op::set_head, part_num: 41, part_tag: HGbsmifbAVhcwuvj}}

Race condition 2:

This race condition happens at a slight different chance from the above one. Please see the following log lines from different rgw instances.
The result is the following journal entry stuck in the meta data forever.
journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }}

dev.1

2022-11-12T11:52:29.481-0500 7f39e0c79700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1900859 3.e 3:73563833:::data_log.13:head [call fifo.get_meta
in=19b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e367) v8 -- 0x560da6844c00 con 0x560d8d23c800

...
2022-11-12T11:52:29.492-0500 7f3be372c700  1 -- 10.74.4.158:0/2973121943 <== osd.2 v2:10.74.0.11:6800/1386235 303168 ==== osd_op_reply(1900859 data_log.13 [call out=244b] v0'0 uv28771 ondisk = 0) v8 ==== 155+0+244 (cr
c 0 0 0) 0x560d98212240 con 0x560d8d23c800

2022-11-12T11:52:29.492-0500 7f39e0c79700  0 req 17546264727929927449 0.147999316s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1325 tid=1777 info=(id: data_log.1
3, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, ma
x_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})

dev.1

2022-11-12T11:52:29.481-0500 7f380c0d0700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1900861 3.e 3:73563833:::data_log.13:head [call fifo.get_meta
in=19b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e367) v8 -- 0x560d8de12800 con 0x560d8d23c800
...

2022-11-12T11:52:29.492-0500 7f3be372c700  1 -- 10.74.4.158:0/2973121943 <== osd.2 v2:10.74.0.11:6800/1386235 303169 ==== osd_op_reply(1900861 data_log.13 [call out=244b] v0'0 uv28771 ondisk = 0) v8 ==== 155+0+244 (cr
c 0 0 0) 0x560d98212240 con 0x560d8d23c800

2022-11-12T11:52:29.492-0500 7f380c0d0700  0 req 4817637653451614143 0.079999633s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1325 tid=1778 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})

dev.1  ????????????????????????
2022-11-12T11:52:29.465-0500 7f3b4bf4f700  0 req 14970605871587688978 0.035999831s s3:put_obj RGWDataChangesLog::add_entry() bucket.name=testwkr15-4h-2k-thisisbcstestload0012252 shard_id=0 now=2022-11-12T11:52:29.4670
84-0500 cur_expiration=1969-12-31T19:00:00.000000-0500
2022-11-12T11:52:29.465-0500 7f3b4bf4f700  0 req 14970605871587688978 0.035999831s s3:put_obj RGWDataChangesLog::add_entry() be->push. index=13 key=testwkr15-4h-2k-thisisbcstestload0012252:e1575adc-fc97-4d2d-86dd-b958
130b04f8.6324.7466:0
2022-11-12T11:52:29.465-0500 7f3b4bf4f700  0 req 14970605871587688978 0.035999831s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):
1412 tid=1779 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{5}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_par
t_num: 1, min_push_part_num: 0, max_push_part_num: 1, head_tag: 7Ih1xnRVUR87yuQL, tags: {{1=7Ih1xnRVUR87yuQL}}, journal: {{{}})
2022-11-12T11:52:29.465-0500 7f3b4bf4f700  0 req 14970605871587688978 0.035999831s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):
1470 prepared push: remaining=0 batch=1 retries=0 batch_len=174 tid=1779
2022-11-12T11:52:29.465-0500 7f3b4bf4f700  0 req 14970605871587688978 0.035999831s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):
1475 tid=1779 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{5}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_par
t_num: 1, min_push_part_num: 0, max_push_part_num: 1, head_tag: 7Ih1xnRVUR87yuQL, tags: {{1=7Ih1xnRVUR87yuQL}}, journal: {{{}})
2022-11-12T11:52:29.465-0500 7f3b4bf4f700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.198:6808/1390707,v1:10.74.0.198:6809/1390707] -- osd_op(unknown.0.0:1900799 3.18 3:1ca3d4bd:::data_log.13.5Ix6oXD%2fTeT.1:head [
call fifo.push_part in=171b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e367) v8 -- 0x560d9bb03c00 con 0x560d8cfaa000

2022-11-12T11:52:29.472-0500 7f3be2f2b700  1 -- 10.74.4.158:0/2973121943 <== osd.3 v2:10.74.0.198:6808/1390707 415651 ==== osd_op_reply(1900799 data_log.13.5Ix6oXD/TeT.1 [call] v367'52591 uv52585 ondisk = -34 ((34) Nu
merical result out of range)) v8 ==== 169+0+0 (crc 0 0 0) 0x560d95ce1b00 con 0x560d8cfaa000

2022-11-12T11:52:29.472-0500 7f3b4bf4f700 -1 req 14970605871587688978 0.042999800s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=1779
2022-11-12T11:52:29.472-0500 7f3b4bf4f700 -1 req 14970605871587688978 0.042999800s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1189 push_part failed: r=-34 tid=1779
2022-11-12T11:52:29.472-0500 7f3b4bf4f700  0 req 14970605871587688978 0.042999800s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1482 need new head tid=1779
2022-11-12T11:52:29.472-0500 7f3b4bf4f700  0 req 14970605871587688978 0.042999800s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):870 tid=1779 new_head_part_num=2 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{7}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{}})
2022-11-12T11:52:29.472-0500 7f3b4bf4f700  0 req 14970605871587688978 0.042999800s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):905 tid=1779 jentry=(op: Op::set_head, part_num: 2, part_tag: )
2022-11-12T11:52:29.472-0500 7f3b4bf4f700  0 req 14970605871587688978 0.042999800s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):912 updating metadata: i=0 tid=1779
2022-11-12T11:52:29.472-0500 7f3b4bf4f700  0 req 14970605871587688978 0.042999800s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):474 tid=1779 version=4OkaLaZsVoIVzNgB{7} (*****update=(journal_entries_add: {[op: Op::set_head, part_num: 2, part_tag: ]})*****) info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{7}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{}})
2022-11-12T11:52:29.472-0500 7f3b4bf4f700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1900831 3.e 3:73563833:::data_log.13:head [call fifo.update_meta in=89b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e367) v8 -- 0x560db3d04c00 con 0x560d8d23c800

2022-11-12T11:52:29.492-0500 7f3be372c700  1 -- 10.74.4.158:0/2973121943 <== osd.2 v2:10.74.0.11:6800/1386235 303167 ==== osd_op_reply(1900831 data_log.13 [call] v367'28771 uv28771 ondisk = 0) v8 ==== 155+0+0 (crc 0 0
 0) 0x560d98212240 con 0x560d8d23c800
2022-11-12T11:52:29.492-0500 7f3b4bf4f700  0 req 14970605871587688978 0.062999710s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):483 rgw_rados_operate r=0. -ECANCELED=-125
bool*, uint64_t, optional_yield):483 rgw_rados_operate r=0. -ECANCELED=-125
2022-11-12T11:52:29.492-0500 7f3b68f89700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.11:6808/1387815,v1:10.74.0.11:6809/1387815] -- osd_op(unknown.0.0:1900879 8.9 8:925fb5de:::e1575adc-fc97-4d2d-86dd-b958130b04f8.6092.15188_myobjects30000250:head [cmpxattr user.rgw.idtag (62) op 1 mode 1 in=76b,create,call rgw.obj_remove in=40b,setxattr user.rgw.idtag (61) in=75b,setxattr user.rgw.tail_tag (61) in=78b,writefull 0~2000 in=2000b,setxattr user.rgw.manifest (392) in=409b,setxattr user.rgw.acl (177) in=189b,setxattr user.rgw.content_type (25) in=46b,setxattr user.rgw.etag (32) in=45b,call rgw.obj_store_pg_ver in=44b,setxattr user.rgw.source_zone (4) in=24b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e367) v8 -- 0x560da04ee400 con 0x560d96c62400
2022-11-12T11:52:29.492-0500 7f3b4bf4f700  0 req 14970605871587688978 0.062999710s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, const rados::cls::fifo::update&, uint64_t):442 tid=1779 objv=4OkaLaZsVoIVzNgB{7} *info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})
2022-11-12T11:52:29.492-0500 7f3b4bf4f700 -1 req 14970605871587688978 0.062999710s s3:put_obj int rgw::cls::fifo::FIFO::apply_update(const DoutPrefixProvider*, rados::cls::fifo::info*, const rados::cls::fifo::objv&, const rados::cls::fifo::update&, uint64_t):447 version mismatch, canceling: tid=1779
2022-11-12T11:52:29.492-0500 7f3b4bf4f700  0 req 14970605871587688978 0.062999710s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):490 apply_update r=-125
2022-11-12T11:52:29.492-0500 7f3b4bf4f700  0 req 14970605871587688978 0.062999710s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):495 operation has been canceled.

2022-11-12T11:52:29.492-0500 7f3b4bf4f700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1900880 3.e 3:73563833:::data_log.13:head [call fifo.get_meta
in=19b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e367) v8 -- 0x560d9d71b000 con 0x560d8d23c800
...

2022-11-12T11:52:29.493-0500 7f3be372c700  1 -- 10.74.4.158:0/2973121943 <== osd.2 v2:10.74.0.11:6800/1386235 303171 ==== osd_op_reply(1900880 data_log.13 [call out=244b] v0'0 uv28771 ondisk = 0) v8 ==== 155+0+244 (cr
c 0 0 0) 0x560d98212240 con 0x560d8d23c800

2022-11-12T11:52:29.494-0500 7f3b4bf4f700  0 req 14970605871587688978 0.064999700s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1325 tid=1779 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})
2022-11-12T11:52:29.494-0500 7f3b4bf4f700  0 req 14970605871587688978 0.064999700s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):498 read_meta r=0
2022-11-12T11:52:29.494-0500 7f3b4bf4f700  0 req 14970605871587688978 0.064999700s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint64_t, optional_yield):505 canceled: tid=1779
2022-11-12T11:52:29.494-0500 7f3b4bf4f700  0 req 14970605871587688978 0.064999700s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):922 raced, but journaled and processed: i=0 tid=1779
2022-11-12T11:52:29.494-0500 7f3b4bf4f700  0 req 14970605871587688978 0.064999700s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1470 prepared push: remaining=0 batch=1 retries=1 batch_len=174 tid=1779
2022-11-12T11:52:29.494-0500 7f3b4bf4f700  0 req 14970605871587688978 0.064999700s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1475 tid=1779 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})
2022-11-12T11:52:29.494-0500 7f3b4bf4f700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.11:6816/1389295,v1:10.74.0.11:6817/1389295] -- osd_op(unknown.0.0:1900885 3.9 3:92a478d3:::data_log.13.5Ix6oXD%2fTeT.2:head [call fifo.push_part in=171b] snapc 0=[] ondisk+write+known_if_redirected+returnvec+supports_pool_eio e367) v8 -- 0x560d8ed64000 con 0x560d93204000

2022-11-12T11:52:29.548-0500 7f3be372c700  1 -- 10.74.4.158:0/2973121943 <== osd.8 v2:10.74.0.11:6816/1389295 148603 ==== osd_op_reply(1900885 data_log.13.5Ix6oXD/TeT.2 [call] v367'39610 uv39610 ondisk = 1) v8 ==== 16
9+0+0 (crc 0 0 0) 0x560d98212240 con 0x560d93204000

2022-11-12T11:52:29.549-0500 7f3b4bf4f700  1 ====== req done req=0x7f37acdfe700 op status=0 http_status=200 latency=0.119999439s ======
2022-11-12T11:52:29.549-0500 7f3b4bf4f700  1 beast: 0x7f37acdfe700: 10.74.0.15 - cosbench-001 [12/Nov/2022:11:52:29.429 -0500] "PUT /testwkr15-4h-2k-thisisbcstestload0012252/myobjects30000252 HTTP/1.1" 200 2000 - "aws-sdk-java/1.10.76 Linux/3.10.0-1160.59.1.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.172-b11/1.8.0_172" - latency=0.119999439s

dev.1
2022-11-12T11:52:29.506-0500 7f3a1ecf5700  1 -- 10.74.4.158:0/2973121943 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1900913 3.e 3:73563833:::data_log.13:head [call fifo.get_meta
in=19b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e367) v8 -- 0x560da3c55000 con 0x560d8d23c800
...

2022-11-12T11:52:29.507-0500 7f3be372c700  1 -- 10.74.4.158:0/2973121943 <== osd.2 v2:10.74.0.11:6800/1386235 303176 ==== osd_op_reply(1900913 data_log.13 [call out=244b] v0'0 uv28771 ondisk = 0) v8 ==== 155+0+244 (cr
c 0 0 0) 0x560d98212240 con 0x560d8d23c800

2022-11-12T11:52:29.507-0500 7f3a1ecf5700  0 req 13818150964604892463 0.104999512s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1325 tid=1780 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})

dev.2   ****************

2022-11-12T11:52:29.440-0500 7f2dfc2d8700  0 req 885948629711350777 0.179999173s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):707 postprocessing: i=0 tid=1758
2022-11-12T11:52:29.440-0500 7f2dfc2d8700  0 req 885948629711350777 0.179999173s s3:put_obj int rgw::cls::fifo::FIFO::_update_meta(const DoutPrefixProvider*, const rados::cls::fifo::update&, rados::cls::fifo::objv, bool*, uint
64_t, optional_yield):474 tid=1758 version=4OkaLaZsVoIVzNgB{6} update=(head_part_num: 2, max_push_part_num: 2, journal_entries_rm: {[op: Op::create, part_num: 2, part_tag: 7LedASIbULCce4HS,op: Op::set_head, part_num: 2, part_t
ag: 7LedASIbULCce4HS]}) info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{6}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_pa
rt_num: 1, min_push_part_num: 0, max_push_part_num: 1, head_tag: 7Ih1xnRVUR87yuQL, tags: {{1=7Ih1xnRVUR87yuQL,2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::create, part_num: 2, part_tag: 7LedASIbULCce4HS,2=op: Op::set_head, part
_num: 2, part_tag: 7LedASIbULCce4HS}})
2022-11-12T11:52:29.440-0500 7f2dfc2d8700  1 -- 10.74.4.158:0/329199693 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1900630 3.e 3:73563833:::data_log.13:head [call fifo.update_meta in=159b
] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e367) v8 -- 0x559ba15fc000 con 0x559b8ad4d400
...

2022-11-12T11:52:29.468-0500 7f2f6c466700  1 -- 10.74.4.158:0/329199693 <== osd.2 v2:10.74.0.11:6800/1386235 303481 ==== osd_op_reply(1900630 data_log.13 [call] v367'28767 uv28767 ondisk = 0) v8 ==== 155+0+0 (crc 0 0 0) 0x559b
a30bdb00 con 0x559b8ad4d400

dev.2

2022-11-12T11:52:29.486-0500 7f2dc9272700  1 -- 10.74.4.158:0/329199693 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1900741 3.e 3:73563833:::data_log.13:head [call fifo.get_meta in=19b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e367) v8 -- 0x559b9f514000 con 0x559b8ad4d400

...

2022-11-12T11:52:29.491-0500 7f2f6c466700  1 -- 10.74.4.158:0/329199693 <== osd.2 v2:10.74.0.11:6800/1386235 303485 ==== osd_op_reply(1900741 data_log.13 [call out=244b] v0'0 uv28771 ondisk = 0) v8 ==== 155+0+244 (crc 0 0 0) 0
x559ba30bdb00 con 0x559b8ad4d400

2022-11-12T11:52:29.491-0500 7f2dc9272700  0 req 16397087694771293664 0.158999264s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1325 tid=1759 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})

--------------------------------------------
0-161,

dev.1

2022-11-12T11:52:29.521-0500 7f14a5d1f700  1 -- 10.74.0.161:0/2247237324 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1997625 3.e 3:73563833:::data_log.13:head [call fifo.get_meta in=19b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e367) v8 -- 0x556205149c00 con 0x5561fef76000
...

2022-11-12T11:52:29.522-0500 7f17d622d700  1 -- 10.74.0.161:0/2247237324 <== osd.2 v2:10.74.0.11:6800/1386235 314446 ==== osd_op_reply(1997625 data_log.13 [call out=244b] v0'0 uv28771 ondisk = 0) v8 ==== 155+0+244 (crc 0 0 0) 0x556212dad8c0 con 0x5561fef76000

2022-11-12T11:52:29.522-0500 7f14a5d1f700  0 req 17620437727120733541 0.098001137s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1325 tid=1787 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})

dev.2

2022-11-12T11:52:29.777-0500 7f6cb97e7700  1 -- 10.74.0.161:0/2569115418 --> [v2:10.74.0.11:6800/1386235,v1:10.74.0.11:6801/1386235] -- osd_op(unknown.0.0:1870919 3.e 3:73563833:::data_log.13:head [call fifo.get_meta in=19b] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e367) v8 -- 0x55cbc1bb6800 con 0x55cbc15fd800
...

2022-11-12T11:52:29.778-0500 7f6cfdf1e700  1 -- 10.74.0.161:0/2569115418 <== osd.2 v2:10.74.0.11:6800/1386235 300304 ==== osd_op_reply(1870919 data_log.13 [call out=244b] v0'0 uv28771 ondisk = 0) v8 ==== 155+0+244 (crc 0 0 0) 0x55cbbfd53440 con 0x55cbc15fd80
2022-11-12T11:52:29.779-0500 7f6cb97e7700  0 req 12834601832349424490 0.101001166s s3:put_obj int rgw::cls::fifo::FIFO::read_meta(const DoutPrefixProvider*, uint64_t, optional_yield):1325 tid=1779 info=(id: data_log.13, version: 4OkaLaZsVoIVzNgB{8}, oid_prefix: data_log.13.5Ix6oXD/TeT, params: {max_part_size: 1048576, max_entry_size: 32768, full_size_threshold: 1015750}, tail_part_num: 0, head_part_num: 2, min_push_part_num: 0, max_push_part_num: 2, head_tag: 7LedASIbULCce4HS, tags: {{2=7LedASIbULCce4HS}}, journal: {{{2=op: Op::set_head, part_num: 2, part_tag: }})

#45 Updated by Jane Zhu 3 months ago

Adam, I submitted a PR on top of yours. My 8h test has passed with this PR. The same test usually failed on earlier versions within a hour or two. We are doing more tests now to see if more racing conditions would manifest.

https://github.com/adamemerson/ceph/pull/1

#46 Updated by Adam Emerson 3 months ago

I have pushed a new commit. It disables all use of the part tags. I believe this should eliminate any remaining difficulties.

It is also my suspicion that with this change we will no longer need Juan Zhu's commit 'rgw/fifo: fix some race conditions on FIFO journal entries' as the situation where a client gets the wrong (or no) head tag can no longer occur.

(I haven't removed it from the PR, but if you can test with and without it and see if it still makes a difference I'd appreciate it.)

#47 Updated by Adam Emerson 3 months ago

Hold off for now, I've introduced one problem I need to debug.

#48 Updated by Oguzhan Ozmen 3 months ago

Adam Emerson wrote:

Hold off for now, I've introduced one problem I need to debug.

Ack.

We are currently testing the HEAD at https://github.com/ceph/ceph/commit/74848f8b0a7b4fa13e286351bb7da91d11b8cbaa.
Let us know when your new commit (cls/fifo: Remove all use of 'tags') is ready.

#49 Updated by Krunal Chheda 3 months ago

Hey Adam,
So after all of our current testing and debugging of issue, the current race condition is that the same part is created twice with different version, resulting in failure to write to datalog, so the secondary is not able to pull the changes.
More importantly same part creation is happening on the SAME RGW INSTANCE with multiple thread, the multi-rgw instance racing issue is already resolved with the changes inside cls_fifo_types, but the same instance with multiple threads is still able to race the new head/part creation.
The reason for that is as part of create new head inside cls_legacy_fifo.cc, we intermittently hold locks inside functions (_prepare_new_head --> _prepare_new_part --> update_meta --> apply_update --> process_journal --> create_part -> update_meta --> apply_udpate). Because of intermittent locks, often the other thread is sneaking in and succeeding in creating new part with different version even after all the checks.

So what i just tested was, lock the creation of prepare_new_head under a SINGLE lock, so the intermittent lock issue would go away and at any given time only 1 thread will be able to create new part at a time.
And as part of testing, i did aggressive testing by reducing the part size to 1M and datalog shards to 32, so this way the part gets filled up faster and new_head creation is called more often so the racing condition is stress tested and i was able to successfully match the object counts across zones after running aggresive test for 4 hours.
Here is the link to PR .The PR is based out of your current latest changes in your current branch, so you can only see the relevant changes.
Also holding lock for longer time might have some performance implication, however in real/ideal world the new head creation will not happen every few seconds, so there shouldn't be significant performance implication.

Take a look at PR and let me know your suggestion,
Also along with lock changes i have another minor change on version increment. Earlier the version increment for OSD header was happening inside cls_fifo.cc (write_header()) and for rgw it was happening inside the fifo::apply_udpate, so instead i moved the version increment inside the cls_fifo_types.h(apply_update), so both osd and rgw have the version incremented at the same place.

#50 Updated by Krunal Chheda 2 months ago

Hi Adam,
Wanted to provide you with an update about the testing that we did over the weekend,
We took 2 PR's, one with update `lastest - 1` PR (without the tags changes) and second one with the changes proposed in previous note #49 (again without the tag changes, but holding a global lock while creating new head).

We ran the tests continuously for more than 24 hours (wrote around 5 million S3 objects spread across 10000 buckets with 10 S3 users) and we saw a object mismatch for the `lastest - 1` PR, but for the proposed PR the objects count MATCHED.

For the object mismatch scenario we are yet to analyze the logs, however we suspect it could still be due to race condition on multiple threads trying to create the head for same part. We are going to continue with the load test on our proposed PR as that has object count intact in the 24 hour run.
Whenever you get a chance do take a look at the proposed changes and let us know your feedback.

Also based on our previous test analysis, we think removal of tags will still not completely solve the race condition as there were race conditions where even the with same tag we had multiple head created for same part.

#51 Updated by Jane Zhu 2 months ago

Did some investigation on the latest failure of the tests on `lastest - 1` PR (without the tags changes), and it looks caused by a different issue. Some logs are here:
On osd:

2022-11-18T12:20:28.173-0500 7f1789383700  0 <cls> /root/rpmbuild/BUILD/ceph-18.0.0-0.bb.pr5h/src/cls/fifo/cls_fifo.cc:112: ERROR: int rados::cls::fifo::{anonymous}::read_part_header(cls_method_context_t, rados::cls::fifo::part_header*): failed decoding part header
...
2022-11-18T12:20:28.173-0500 7f1789383700  0 <cls> /root/rpmbuild/BUILD/ceph-18.0.0-0.bb.pr5h/src/cls/fifo/cls_fifo.cc:487: int rados::cls::fifo::{anonymous}::push_part(cls_method_context_t, ceph::buffer::v15_2_0::list*, ceph::buffer::v15_2_0::list*): failed to read part header
...

on rgw:

...
2022-11-18T12:20:28.164-0500 7f8ce6bad700  0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
2022-11-18T12:20:28.181-0500 7f8b8a246700 -1 req 18099188981836744004 0.054000098s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, co
nst string&, std::string_view, std::deque<ceph::buffer::v15_2_0::list>, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-5 tid=1159
2022-11-18T12:20:28.181-0500 7f8b8a246700 -1 req 18099188981836744004 0.054000098s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v1
5_2_0::list>&, uint64_t, optional_yield):1146 push_part failed: r=-5 tid=1159
2022-11-18T12:20:28.181-0500 7f8b8a246700 -1 req 18099188981836744004 0.054000098s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::
list>&, optional_yield):1446 push_entries failed: r=-5 tid=1159
2022-11-18T12:20:28.181-0500 7f8b8a246700 -1 req 18099188981836744004 0.054000098s s3:put_obj virtual int RGWDataChangesFIFO::push(const DoutPrefixProvider*, int, ceph::real_time, const string&
, ceph::buffer::v15_2_0::list&&): unable to push to FIFO: data_log.69: (5) Input/output error
2022-11-18T12:20:28.181-0500 7f8b8a246700 -1 req 18099188981836744004 0.054000098s s3:put_obj ERROR: failed writing data log
...

It looks like something messed up in the part header.
The issue scatters across multiple datalog shards. Will continue to investigate what could be the potential reason to cause this.

#52 Updated by Jane Zhu 2 months ago

Hi Adam, another question regarding your changes to remove the use of "tags". I'd like to understand your opinion/plan on this. Is it gonna be as a part of the fix for this replication issue or as a code cleaning-up/improvement? If it's the former, any update on this front? If it's the latter, should it be in a separate PR?

#53 Updated by Adam Emerson 2 months ago

Also based on our previous test analysis, we think removal of tags will still not completely solve the race condition as there were race conditions where even the with same tag we had multiple head created for same part.

Can you explain this to me? Create and set head should, if we remove the tag, be completely idempotent. (That is, having two nodes running set-head might cause one of them to have a different tag than the other which can screw things up.)

Set-head should do nothing but set the current head to a given part (which should be fine if it's the same part and we aren't using part-tags)

And create should simply return success if the part already exists.

Which bit of this is failing? If it's one of those three assumptions going wrong I'd like to fix that and restore the idempotence we should have.

#54 Updated by Adam Emerson 2 months ago

Also the goal is to remove tags as part of the fix.

We had some confusion over where a regression is and it's confirmed to be part of the current tags commit, so I'm going to make a more minimal one that should touch less.

#55 Updated by Krunal Chheda 2 months ago

So currently without the tag changes we see a issue as mentioned here in comment
we are analyzing the logs, so far what we see is 2 rgw instances spread across different host have different INFO structure and one of them is pushing to different part header ( one is pushing to 74 and other to 75) and getting EIO (failed decoding header while reading the header)and hence it fails to push entry to datalog and sync fails.
as part of fix we reload the metadata when we get EIO and are able to get away with this.
Inside FIFO:Push() currently any error other than ERANGE we do not retry pushing and so datalog entry is not written resulting in mismatch.
so we just added a code to read metadata when we receive EIO and are able to push the entry to datalog.
We are still analyzing why we are getting EIO
BUt we still feel, today apart from ERANGE we do not retry and this will result in object mismatch
So we should retry.. worst case if the error is not recoverable the retry will still fail.. but retry will be tried 10 times and eventually it will give up.

#56 Updated by Jane Zhu 2 months ago

Adam Emerson wrote:

Also the goal is to remove tags as part of the fix.

We had some confusion over where a regression is and it's confirmed to be part of the current tags commit, so I'm going to make a more minimal one that should touch less.

Thanks for the update Adam!
In this case, I wonder if you could expedite this work since we are currently being blocked on this.
With this change pending, all our tests/investigations based on the earlier PR could be wasted.

#57 Updated by Adam Emerson 2 months ago

Yes, I just pushed it. We're still testing for regression, but you're welcome to try it.

#58 Updated by Jane Zhu 2 months ago

Adam Emerson wrote:

Yes, I just pushed it. We're still testing for regression, but you're welcome to try it.

Thanks Adam!

#59 Updated by Adam Emerson 2 months ago

Can you point me to the PR for retrying on EIO?

#60 Updated by Krunal Chheda 2 months ago

Adam Emerson wrote:

Can you point me to the PR for retrying on EIO?

https://github.com/adamemerson/ceph/pull/4/commits/d23033228fb9cbffc044ad2eec1143397a99c0ff

#61 Updated by Krunal Chheda 2 months ago

So coming back to EIO analysis, what we have found so far is this issue happens for more than 1 RGW instances running.
When the cosbench run is spread evenly across the 2 RGW instances, the info structure on both RGW is in sync with the OSD for each datalog_shard, coz of continuous I/O on both rgw's. On an event when the part is full, both rgw's will go and try to create new_part/set_head and only one of the RGW instance succeeds and other instance gets raced journal entry while trying to create new part and it receives ECANCELLED from osd and hence it calls the read_meta and update its info struct to point to new part num.

However in an scenario where the I/O is not spread evenly across the RGW instances, one of the RGW instance is not getting continuous load, so its info structure is pointing to STALE part_num, while the RGW with continuous load goes and creates new part num and its info struct is updated with new part_num, however the other rgw is still pointing to previous part_num. Now when the stale RGW tries to push the part with the stale info struct, it gets an EIO. The EIO comes from OSD while trying to decode the part header, the actual error when printed is: End of buffer [buffer:2]

  part_header part_header;
  int r = read_part_header(hctx, &part_header);
  if (r < 0) {
    CLS_ERR("%s: failed to read part header", __PRETTY_FUNCTION__);
    return r;
  }

Logs from Active RGW:

2022-11-29T14:51:48.894-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1431 preparing push: remaining=0 batch=8 retries=0 tid=249 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:51:48.894-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1159 entering: tid=249 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:51:48.897-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1462 erange, trying new head need new head tid=249 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:51:48.897-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):847 entering: tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.897-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):857 need new part: tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.897-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):761 entering: tid=249 new_part_num= 2879 is_head = 1 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.897-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):783 needs new head: tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.897-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):796 updating metadata: i=0 tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.900-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):632 entering: tid=249 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:51:48.900-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):646 processing entry: entry=op: Op::create, part_num: 2879, part_tag: 10vDISYrC17SOvP6 tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.900-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::create_part(const DoutPrefixProvider*, int64_t, std::string_view, uint64_t, optional_yield):590 entering: tid= 249 part_num = 2879 tag = << 10vDISYrC17SOvP6 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.910-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):646 processing entry: entry=op: Op::set_head, part_num: 2879, part_tag: 10vDISYrC17SOvP6 tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.910-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):690 postprocessing: i=0 tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.913-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1431 preparing push: remaining=0 batch=8 retries=1 tid=249 oid= data_log.3./hjk7fsvjH2.2879
2022-11-29T14:51:48.913-0500 7f2e9c79f700  1 rgw data changes log: int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1159 entering: tid=249 oid= data_log.3./hjk7fsvjH2.2879

Logs from Stale RGW:

2022-11-29T14:51:40.166-0500 7fc3780d5700  1 rgw data changes log: int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1431 preparing push: remaining=14 batch=0 retries=0 tid=391 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:51:40.166-0500 7fc3780d5700  1 rgw data changes log: int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1159 entering: tid=391 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:54:44.423-0500 7fbf657ff700  1 req 2237089939312786455 0.020000139s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1431 preparing push: remaining=1 batch=0 retries=0 tid=394 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:54:44.423-0500 7fbf657ff700  1 req 2237089939312786455 0.020000139s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque<ceph::buffer::v15_2_0::list>&, uint64_t, optional_yield):1159 entering: tid=394 oid= data_log.3./hjk7fsvjH2.2878
2022-11-29T14:54:44.431-0500 7fbf657ff700  1 req 2237089939312786455 0.028000196s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1479 push_entries failed: r=-5 tid = 394 oid= data_log.3./hjk7fsvjH2.2878

If we look at the timestamp, partnum 2879 for datalog_3 was created at 2022-11-29T14:51:48.913-0500 by the active RGW instance. Since the stale RGW did not have an I/O to that particular datalog_3,it still holds the previous part_num 2878, now the stale RGW receives a push_request at 2022-11-29T14:54:44.423-0500 (after the new part num had already been created by active RGW). and it receives EIO (r=-5).
Since current code path only retries on ERANGE, for EIO the datalog is not updated and hence the objects are not synced to secondary.
we could not find a reason on why reading header should fail with error End of buffer [buffer:2] while reading the part_header which is stale. .
However to fix the issue we did a retry after reading metadata on EIO. read_meta updates the stale RGW info structure with new part_num and on retry he entry is pushed to datalog.
Here is the PR that does read_meta on EIO and then retries pushing.

#62 Updated by Adam Emerson 2 months ago

This is very useful, thank you, and might explain why it's happening.

So from your logs, does it seem like the part is getting trimmed/deleted by the active RGW before the stale RGW tries to push to it?

#63 Updated by Adam Emerson 2 months ago

I've pushed a commit that uses assert_exists, then fetches metadata on -ENOENT.

#64 Updated by Krunal Chheda 2 months ago

Adam Emerson wrote:

I've pushed a commit that uses assert_exists, then fetches metadata on -ENOENT.

Thanks for quick response, will need to verify that part was trimmed.
Regarding the ENOENT, with assert_exists in place now, will the read_part_header on the trimmed part return ENOENT now ?
I do not see any changes in cls_fifo.cc -> read_part_header(), it still returns EIO, so where will the ENOENT come from?

#65 Updated by Adam Emerson 2 months ago

Krunal Chheda wrote:

Regarding the ENOENT, with assert_exists in place now, will the read_part_header on the trimmed part return ENOENT now ?
I do not see any changes in cls_fifo.cc -> read_part_header(), it still returns EIO, so where will the ENOENT come from?

I added an assert_exists() to the compound operation containing the push_part CLS call.

This should fail if the part has been deleted, returning -ENOENT, and keep the CLS code for push_part from ever executing (and calling read_part_header()) in the first place.

#66 Updated by Adam Emerson 2 months ago

Specifically in cls_fifo_legacy.cc/push_part()

#67 Updated by Krunal Chheda 2 months ago

Adam Emerson wrote:

Krunal Chheda wrote:

Regarding the ENOENT, with assert_exists in place now, will the read_part_header on the trimmed part return ENOENT now ?
I do not see any changes in cls_fifo.cc -> read_part_header(), it still returns EIO, so where will the ENOENT come from?

I added an assert_exists() to the compound operation containing the push_part CLS call.

This should fail if the part has been deleted, returning -ENOENT, and keep the CLS code for push_part from ever executing (and calling read_part_header()) in the first place.

Thanks once again for explanation, i was trying to figure out how the assert_exists() works without taking any params.
But ENOENT makes complete sense, the code will never go to CLS for pushing, which is better performance.
Also i just did a look up on datalog part_num for which the read_header was failing (immediately after i got a EIO) and i do not see the part_num, mostly likely that part_num would have trimmed.

#68 Updated by Adam Emerson 2 months ago

Krunal Chheda wrote:

Thanks once again for explanation, i was trying to figure out how the assert_exists() works without taking any params.

`ObjectReadOperation` and `ObjectWriteOperation` are both vectors of ops. The ops themselves don't take an object as a parameter, the object name is sent along with the vector as an argument to the operate function.

(Under the hood `assert_exists()` is just `stat()` but we throw away the size and modification time.)

#69 Updated by Krunal Chheda 2 months ago

Adam Emerson wrote:

Krunal Chheda wrote:

Thanks once again for explanation, i was trying to figure out how the assert_exists() works without taking any params.

`ObjectReadOperation` and `ObjectWriteOperation` are both vectors of ops. The ops themselves don't take an object as a parameter, the object name is sent along with the vector as an argument to the operate function.

(Under the hood `assert_exists()` is just `stat()` but we throw away the size and modification time.)

Yeah i saw the stat part, but was confuse on how it works, so from what i understand is it registers a stat handler to check for the object existence before doing any operations on it. So before calling cls->push_part it checks the existence and returns ENOENT ?
Thanks once again for quick response.
One last thing, in the PR that i posted, i also made a trivial change with respect to increment of info.version.ver, currently we increment it separately for rgw instance and osd(cls) INFO instance, i just made the increment central to both rgw & osd (inside cls->apply_update()), it make all changes to info structure consolidate to single function, if we could merge that it will be cleaner.

#70 Updated by Adam Emerson 2 months ago

I have a more thoroughly cleaned up and refactored fix at.

Apart from other changes, it collapses identical journal operations rather than rejecting them.

https://github.com/ceph/ceph/pull/49179

#71 Updated by Krunal Chheda 2 months ago

Thanks Adam, we looked at the new PR and see that you are using a vector instead of multi-map. And then do a find on vector for same operations (set_head & create_part) & same part_num, we could also use a hash_map with {operation & part_num)pair as key, so the find would be faster ?
Second thing i also see you now have removed the TAG completely, you had mentioned earlier it introduced a regression ? Was the regression related to reading existing rados object with this new code, the decode was failing coz the tag was not read as part new code ? And in this new code you have removed tag, but for compatibility while decoding/encoding you still using dummy tag ? or was there any other regression ?

just an update on testing, we tested the the original multi-map PR for 2 consecutive days and we do not see any issues w.r.t object mismatch.
We are going to initiate another run for the weekend with this new changes as well.

#72 Updated by Krunal Chheda about 2 months ago

Hey Adam,
Quick update on the testing that was done on both the latest PR commits (multimap and vector one), we did a 48 hour continuous load test on ceph cluster with both the fixes and we DID-NOT see any issue w.r.t object count. The object count matched post completion of workload, we almost uploaded more than 100 million S3 objects over the clusters and all the objects replicated correctly to the secondary site.

We feel confident about the PR's from the test verification point, let us know when the PR review will complete & the when will PR merge to main & multisite-stabilization branch.

#73 Updated by Krunal Chheda about 2 months ago

Hey Adam,
Quick correction about the testing update for the 2nd pr (using vector), we do see race condition with the info.version resulting in object mismtach.
In the new code, the cls_fifo_types.h::apply_update() does not return any value when multiple rgw instances try to insert new part and set head.
As a result, both rgw instances would do apply_update and then increment the info.version inside write_header(). This results in multiple increment of version value for same the part_num across different rgw instances.
and eventual failure to read_header where there is version check, which in turn stops the rgw to write to datalog.

  if (objv && !(info->version== *objv)) {
    auto s1 = info->version.to_str();
    auto s2 = objv->to_str();
    CLS_ERR("%s: version mismatch (header=%s, req=%s), canceled operation",
        __PRETTY_FUNCTION__, s1.c_str(), s2.c_str());

2022-12-05T16:22:47.672-0500 7f1347d7d700 0 <cls> /root/rpmbuild/BUILD/ceph-18.0.0-0.bb.rgw_multisite_reshard_reef_pr6_quick/src/cls/fifo/cls_fifo.cc:200: int rados::cls::fifo::{anonymous}::read_header(cls_method_context_t, std::optional<rados::cls::fifo::objv>, rados::cls::fifo::info*, bool): version mismatch (header=cUrwB1iNquuapjCh{1362}, req=cUrwB1iNquuapjCh{1361}), canceled operation

2022-12-05T16:22:47.848-0500 7f725e625700 -1 req 3492424974408702267 3.102004051s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):709 canceled too many times, giving up: tid=35330
2022-12-05T16:22:47.848-0500 7f725e625700 -1 req 3492424974408702267 3.102004051s s3:put_obj int rgw::cls::fifo::FIFO::process_journal(const DoutPrefixProvider*, uint64_t, optional_yield):714 failed, r=: -125 tid=35330
2022-12-05T16:22:47.848-0500 7f725e625700 -1 req 3492424974408702267 3.102004051s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_part(const DoutPrefixProvider*, int64_t, bool, uint64_t, optional_yield):738 process_journal failed: r=-125 tid=35330
2022-12-05T16:22:47.848-0500 7f725e625700 -1 req 3492424974408702267 3.102004051s s3:put_obj int rgw::cls::fifo::FIFO::_prepare_new_head(const DoutPrefixProvider*, int64_t, uint64_t, optional_yield):818 _prepare_new_part failed: r=-125 tid=35330
2022-12-05T16:22:47.848-0500 7f725e625700 -1 req 3492424974408702267 3.102004051s s3:put_obj int rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1410 prepare_new_head failed: r=-125 tid=35330
2022-12-05T16:22:47.848-0500 7f725e625700 -1 req 3492424974408702267 3.102004051s s3:put_obj virtual int RGWDataChangesFIFO::push(const DoutPrefixProvider*, int, ceph::real_time, const string&, ceph::buffer::v15_2_0::list&&): unable to push to FIFO: data_log.15: (125) Operation canceled

So the code needs to be updated to handle multiple set-head/create-part like the previous PR by returning ECANCEL and then not calling write_header()

#74 Updated by Adam Emerson about 2 months ago

Krunal Chheda wrote:

and eventual failure to read_header where there is version check, which in turn stops the rgw to write to datalog.

Where is the client receive that error from read_header? We don't seen to be sending a version to get_version except on open.

#75 Updated by Krunal Chheda about 2 months ago

Adam Emerson wrote:

Krunal Chheda wrote:

and eventual failure to read_header where there is version check, which in turn stops the rgw to write to datalog.

Where is the client receive that error from read_header? We don't seen to be sending a version to get_version except on open.

cls_legacy_fifo.cc calls update_meta which invokes the update_meta() in cls_fifo.cc and inside update_meta() it calls read_header() which checks for the version.
Rgw passes the version info while calling update_meta()

#76 Updated by Adam Emerson about 2 months ago

All right, but that's supposed to happen.

Why isn't the site with the wrong version simply calling read_info and progressing? Are we missing the retry logic somewhere?

#77 Updated by Krunal Chheda about 2 months ago

Adam Emerson wrote:

All right, but that's supposed to happen.

Why isn't the site with the wrong version simply calling read_info and progressing? Are we missing the retry logic somewhere?

It is retrying, coz version_mismatch returns ECANCEL and there is a logic to read_meta(retry) in event of ECANCEL, but somehow its passing the retry_limit and then giving up. if you check the log we are getting

 canceled too many times, giving up:

from what i can understand is, there are too many rgw_thread/instances that at same time are updating the version and everytime it re-reads the meta and then retires, the new version has been pushed by other rgw thread/instance and after 10 retires it gives up.
so that is why i was pointing that once the update has been made by 1 thread, the other updates to same part from other thread/instances are just incrementing the version count, so rather if they just re-read the metadata this issue will not happen

#78 Updated by Adam Emerson about 2 months ago

Please try the new version.

If possible, first without the last commit (`cls/fifo: Add more retries with random backoff`) and if that fails try running with it.

#79 Updated by Krunal Chheda about 2 months ago

Adam Emerson wrote:

Please try the new version.

If possible, first without the last commit (`cls/fifo: Add more retries with random backoff`) and if that fails try running with it.

we already ran the workload without the last commit (more retries with random backoff) and thats where we saw the issue (cancelled too many times)
so what do we test without backoff ?

#80 Updated by Adam Emerson about 2 months ago

I added a commit up the line, "cls/fifo: Move version inc into `apply_update()`"

#81 Updated by Krunal Chheda about 2 months ago

Adam Emerson wrote:

I added a commit up the line, "cls/fifo: Move version inc into `apply_update()`"

oh ok i see, so you made the changes i was suggesting back in my PR, (to have version incremented centrally at one place),
however i also made a comment on the PR, if we make additional check in the if condition

if (update.max_push_part_num() && *update.max_push_part_num() != max_push_part_num){
   max_push_part_num = *update.max_push_part_num();
   changed = true;
}
for all of the if conditions where we are updating `changed` variable value.

This ensures that apply_update is only changing if there is any change, else it does not write the header. this will be more bullet proof check IMO.

#82 Updated by Krunal Chheda about 1 month ago

Hey Adam,
An update on the latest testing that was done on the cleanup PR
we did continuous stress test and everything went good, the object count matched and we did not see any error where the PUSH-PART failed for an non retry-able errors
And we did the testing WITHOUT the random back-off changes, so we can get rid of those changes as discussed earlier !
Also we did testing on top of reef multisite stabilization branch, hope the current fix should be independent of main branch/stabilization branch

#83 Updated by Adam Emerson 22 days ago

#84 Updated by Adam Emerson 22 days ago

#85 Updated by Adam Emerson 13 days ago

  • Status changed from Fix Under Review to Pending Backport

#86 Updated by Backport Bot 13 days ago

  • Tags changed from multisite sync to multisite sync backport_processed

Also available in: Atom PDF