Bug #55465: test/rgw-multi: test_datalog_autotrim is failed. - Ceph - Ceph

Actions

Copy link

Bug #55465

open

test/rgw-multi: test_datalog_autotrim is failed.

Added by ren weiguo about 2 years ago. Updated about 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In my tethology test rgw/multisite,tasks is three-zone test. I find test_datalog_autotrim is failed.
2022-04-21T13:23:34.933 INFO:tasks.rgw_multisite_tests:FAIL: tasks.rgw_multi.tests.test_datalog_autotrim
2022-04-21T13:23:34.934 INFO:tasks.rgw_multisite_tests:----------------------------------------------------------------------
2022-04-21T13:23:34.934 INFO:tasks.rgw_multisite_tests:Traceback (most recent call last):
2022-04-21T13:23:34.934 INFO:tasks.rgw_multisite_tests: File "/home/teuthology/src/teuthology-master/virtualenv/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
2022-04-21T13:23:34.934 INFO:tasks.rgw_multisite_tests: self.test(self.arg)
2022-04-21T13:23:34.934 INFO:tasks.rgw_multisite_tests: File "/home/teuthology/src/ceph/qa/tasks/rgw_multi/tests.py", line 998, in test_datalog_autotrim
2022-04-21T13:23:34.934 INFO:tasks.rgw_multisite_tests: assert before_trim < after_trim, "any datalog entries must be newer than trim"
2022-04-21T13:23:34.934 INFO:tasks.rgw_multisite_tests:AssertionError: any datalog entries must be newer than trim
So I add debug_rgw=20 level to reproduce it.
brief log
2022-04-24T19:39:21.853 DEBUG:teuthology.orchestra.run.plana002:> adjust-ulimits ceph-coverage /mnt/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster c1 datalog status --cluster c1 --debug-rgw 0 --debug-ms 0 --rgw-cache-enabled false
2022-04-24T19:39:22.327 INFO:teuthology.orchestra.run.plana002.stdout:[{"marker":"","last_update":"0.000000"},{"marker":"","last_update":"0.000000"},{"marker":"1_1650800358.088348_685.1","last_update":"2022-04-24 11:39:18.088348Z"},{"marker":"","last_update":"0.000000"}]
2022-04-24T19:39:22.328 INFO:tasks.util.rgw:rgwadmin: c1.client.0 : ['datalog', 'autotrim', '--cluster', 'c1', '--debug-rgw', '20', '--debug-ms', '0']
2022-04-24T19:39:22.328 DEBUG:tasks.util.rgw:rgwadmin: cmd=['adjust-ulimits', 'ceph-coverage', '/mnt/cephtest/archive/coverage', 'radosgw-admin', '--log-to-stderr', '--format', 'json', '-n', 'client.0', '--cluster', 'c1', 'datalog', 'autotrim', '--cluster', 'c1', '--debug-rgw', '20', '--debug-ms', '0']
2022-04-24T19:39:22.328 DEBUG:teuthology.orchestra.run.plana002:> adjust-ulimits ceph-coverage /mnt/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster c1 datalog autotrim --cluster c1 --debug-rgw 20 --debug-ms 0
2022-04-24T19:39:22.751 INFO:teuthology.orchestra.run.plana002.stderr:2022-04-24 19:39:22.741246 7f9a1a09ef80 10 data trim: trimming log shards
2022-04-24T19:39:22.751 INFO:teuthology.orchestra.run.plana002.stderr:2022-04-24 19:39:22.741248 7f9a1a09ef80 10 data trim: trimming log shard 2 at marker=1_1650800351.661341_677.1 last_trim=
2022-04-24T19:39:22.793 DEBUG:teuthology.orchestra.run.plana002:> adjust-ulimits ceph-coverage /mnt/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster c1 datalog list --shard-id 2 --max-entries 1 --cluster c1 --debug-rgw 0 --debug-ms 0 --rgw-cache-enabled false
2022-04-24T19:39:23.261 INFO:teuthology.orchestra.run.plana002.stdout:[{"entity_type":"bucket","key":"wshhld-47:af2b0b0d-db02-449c-83d3-071cd0e2d80d.6525.4","timestamp":"2022-04-24 11:39:18.088348Z"}]
2022-04-24T19:39:23.262 INFO:tasks.rgw_multisite_tests:tasks.rgw_multi.tests.test_datalog_autotrim ... FAIL.
I find that trim is not based on the latest position.
So i add some code to prove my guess.
*Code:
def test_datalog_autotrim():
zonegroup = realm.master_zonegroup()
zonegroup_conns = ZonegroupConns(zonegroup)
buckets, zone_bucket = create_bucket_per_zone(zonegroup_conns)

upload an object to each zone to generate a datalog entry
for zone, bucket in zone_bucket:
k = new_key(zone, bucket.name, 'key')
k.set_contents_from_string('body')

wait for metadata and data sync to catch up
zonegroup_meta_checkpoint(zonegroup)
zonegroup_data_checkpoint(zonegroup_conns)

trim each datalog
for zone, _ in zone_bucket: # read max markers for each shard
status = datalog_status(zone.zone)
datalog_autotrim(zone.zone)

for shard_id, shard_status in enumerate(status):
try:
before_trim = dateutil.parser.isoparse(shard_status['last_update'])
except: # empty timestamps look like "0.000000" and will fail here
continue
entries = datalog_list(zone.zone, ['--shard-id', str(shard_id), '--max-entries', '1'])
if not len(entries):
continue
after_trim = dateutil.parser.isoparse(entries⁰['timestamp'])
+ if before_trim >= after_trim:
+ log.info('zone=%s need datasync', zone.name)
+ zonegroup_data_checkpoint(zonegroup_conns)
+ datalog_autotrim(zone.zone)
+ datalog_list(zone.zone, ['--shard-id', str(shard_id), '--max-entries', '1'])
assert before_trim < after_trim, "any datalog entries must be newer than trim"
Look 1.log.
reslut:
After executing first zonegroup_data_checkpoint, it does not guarantee that all zones are finished synchronously.
In my photo.
zonegroup_data_checkpoint start stage1.it find B,C datalog cauht up,and wait stage2 A datalog cauht up. But the all datalog caught up is in stage3.