Bug #17371
closedRGW loses realm/period/zonegroup/zone data: period overwritten if somewhere in the cluster is still running Hammer
0%
Description
Something in RGW is causing the active realm/zonegroup/zone data to be silently lost or overwritten. I thought I was losing my mind, but then I checked scrollback, and found that it DID show the period data set correctly.
Here is the output of period update --commit, from 2016-09-22/00:35:22.020358 UTC.
root@peon5752:/home/rjohnson/dho-config/congress/tmp# radosgw-admin period update --commit 2016-09-22 00:35:16.649594 7f3d250a9900 0 RGWZoneParams::create(): error creating default zone params: (17) File exists 2016-09-22 00:35:18.660110 7f3d250a9900 0 error read_lastest_epoch .rgw.root:periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.latest_epoch 2016-09-22 00:35:22.020358 7f3d250a9900 1 Set the period's master zonegroup default as the default { "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "epoch": 1, "predecessor_uuid": "fb6c314c-9e34-4273-8779-d0ab16043532", "sync_status": [ "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "" ], "period_map": { "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "zonegroups": [ { "id": "default", "name": "default", "api_name": "us-west-1", "is_master": "true", "endpoints": [ "https:\/\/objects-us-west-1.dream.io", "https:\/\/objects.dreamhost.com" ], "hostnames": [ "objects-us-west-1.dream.io", "objects.dreamhost.com" ], "hostnames_s3website": [ "objects-website-us-west-1.dream.io" ], "master_zone": "default", "zones": [ { "id": "default", "name": "default", "endpoints": [], "log_meta": "true", "log_data": "false", "bucket_index_max_shards": 31, "read_only": "false" } ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement", "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310" } ], "short_zone_ids": [ { "key": "default", "val": 2610307010 } ] }, "master_zonegroup": "default", "master_zone": "default", "period_config": { "bucket_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 }, "user_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 } }, "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310", "realm_name": "gold", "realm_epoch": 2 }
Here is the output from period list & period get as of 2016-09-22/00:56:13.064602 UTC, still good.
rjohnson@peon5752:~$ sudo radosgw-admin period list { "periods": [ "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging", "9e123a4f-5546-4645-b11d-6b21244a5b67", "fb6c314c-9e34-4273-8779-d0ab16043532" ] } rjohnson@peon5752:~$ sudo radosgw-admin period get { "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "epoch": 1, "predecessor_uuid": "fb6c314c-9e34-4273-8779-d0ab16043532", "sync_status": [ "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "" ], "period_map": { "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "zonegroups": [ { "id": "default", "name": "default", "api_name": "us-west-1", "is_master": "true", "endpoints": [ "https:\/\/objects-us-west-1.dream.io", "https:\/\/objects.dreamhost.com" ], "hostnames": [ "objects-us-west-1.dream.io", "objects.dreamhost.com" ], "hostnames_s3website": [ "objects-website-us-west-1.dream.io" ], "master_zone": "default", "zones": [ { "id": "default", "name": "default", "endpoints": [], "log_meta": "true", "log_data": "false", "bucket_index_max_shards": 31, "read_only": "false" } ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement", "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310" } ], "short_zone_ids": [ { "key": "default", "val": 2610307010 } ] }, "master_zonegroup": "default", "master_zone": "default", "period_config": { "bucket_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 }, "user_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 } }, "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310", "realm_name": "gold", "realm_epoch": 2 }
Follow followed by the output of the same as of 2016-09-22/10:34:08 same day, now it's GONE.
root@peon5752:/home/rjohnson# date -u Thu Sep 22 10:34:08 UTC 2016 root@peon5752:/home/rjohnson# radosgw-admin period list { "periods": [ "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging", "9e123a4f-5546-4645-b11d-6b21244a5b67", "fb6c314c-9e34-4273-8779-d0ab16043532" ] } root@peon5752:/home/rjohnson# radosgw-admin period get { "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "epoch": 1, "predecessor_uuid": "fb6c314c-9e34-4273-8779-d0ab16043532", "sync_status": [ "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "" ], "period_map": { "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5", "zonegroups": [ { "id": "default", "name": "default", "api_name": "", "is_master": "true", "endpoints": [], "hostnames": [], "hostnames_s3website": [], "master_zone": "", "zones": [ { "id": "default", "name": "default", "endpoints": [], "log_meta": "false", "log_data": "false", "bucket_index_max_shards": 0, "read_only": "false" } ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement", "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310" } ], "short_zone_ids": [ { "key": "default", "val": 2610307010 } ] }, "master_zonegroup": "default", "master_zone": "default", "period_config": { "bucket_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 }, "user_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 } }, "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310", "realm_name": "gold", "realm_epoch": 2 }
Updated by Robin Johnson over 7 years ago
stat on the contents of the .rgw.root pool shows some interesting mtimes that correspond with the times above.
.rgw.root/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310.control mtime 2016-09-22 00:33:48.000000, size 0 .rgw.root/periods.fb6c314c-9e34-4273-8779-d0ab16043532.latest_epoch mtime 2016-09-22 00:33:48.000000, size 10 .rgw.root/periods.fb6c314c-9e34-4273-8779-d0ab16043532.1 mtime 2016-09-22 00:33:48.000000, size 228 .rgw.root/default.realm mtime 2016-09-22 00:33:48.000000, size 46 .rgw.root/realms_names.gold mtime 2016-09-22 00:33:48.000000, size 46 .rgw.root/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.latest_epoch mtime 2016-09-22 00:34:05.000000, size 10 .rgw.root/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.1 mtime 2016-09-22 00:34:05.000000, size 228 .rgw.root/default.zone.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:34:39.000000, size 17 .rgw.root/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging.latest_epoch mtime 2016-09-22 00:35:02.000000, size 10 .rgw.root/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging mtime 2016-09-22 00:35:16.000000, size 736 .rgw.root/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.latest_epoch mtime 2016-09-22 00:35:18.000000, size 10 .rgw.root/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:35:18.000000, size 104 .rgw.root/default.zonegroup.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 10:22:00.000000, size 17 .rgw.root/zonegroups_names.default mtime 2016-09-22 10:22:00.000000, size 17 .rgw.root/zone_names.default mtime 2016-09-22 10:22:00.000000, size 17 .rgw.root/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.1 mtime 2016-09-22 10:22:00.000000, size 804 .rgw.root/zone_info.default mtime 2016-09-22 10:22:26.000000, size 713 .rgw.root/zonegroup_info.default mtime 2016-09-22 10:22:36.000000, size 244
Also, while the RGW nodes are running 10.2.3, some of the other nodes are running 10.2.2, so I wonder if this is related to #16627 or #17051.
Updated by Robin Johnson over 7 years ago
Also Also, I took a backup of .rgw.root after the last time it was set, and here's the mtimes to compare.
# rados -p .rgw.root.backup-20160922T003618Z-WORKING ls |xargs -n1 rados -p .rgw.root.backup-20160922T003618Z-WORKING stat |sort -k +4 -n .rgw.root.backup-20160922T003618Z-WORKING/default.realm mtime 2016-09-22 00:36:58.000000, size 46 .rgw.root.backup-20160922T003618Z-WORKING/default.zone.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:36:44.000000, size 17 .rgw.root.backup-20160922T003618Z-WORKING/default.zonegroup.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:36:48.000000, size 17 .rgw.root.backup-20160922T003618Z-WORKING/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.1 mtime 2016-09-22 00:36:48.000000, size 984 .rgw.root.backup-20160922T003618Z-WORKING/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.latest_epoch mtime 2016-09-22 00:36:59.000000, size 10 .rgw.root.backup-20160922T003618Z-WORKING/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging.latest_epoch mtime 2016-09-22 00:36:48.000000, size 10 .rgw.root.backup-20160922T003618Z-WORKING/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging mtime 2016-09-22 00:36:44.000000, size 736 .rgw.root.backup-20160922T003618Z-WORKING/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.1 mtime 2016-09-22 00:36:48.000000, size 228 .rgw.root.backup-20160922T003618Z-WORKING/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.latest_epoch mtime 2016-09-22 00:36:59.000000, size 10 .rgw.root.backup-20160922T003618Z-WORKING/periods.fb6c314c-9e34-4273-8779-d0ab16043532.1 mtime 2016-09-22 00:36:58.000000, size 228 .rgw.root.backup-20160922T003618Z-WORKING/periods.fb6c314c-9e34-4273-8779-d0ab16043532.latest_epoch mtime 2016-09-22 00:36:59.000000, size 10 .rgw.root.backup-20160922T003618Z-WORKING/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310.control mtime 2016-09-22 00:36:59.000000, size 0 .rgw.root.backup-20160922T003618Z-WORKING/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:36:42.000000, size 104 .rgw.root.backup-20160922T003618Z-WORKING/realms_names.gold mtime 2016-09-22 00:36:59.000000, size 46 .rgw.root.backup-20160922T003618Z-WORKING/zonegroup_info.default mtime 2016-09-22 00:36:44.000000, size 417 .rgw.root.backup-20160922T003618Z-WORKING/zonegroups_names.default mtime 2016-09-22 00:36:59.000000, size 17 .rgw.root.backup-20160922T003618Z-WORKING/zone_info.default mtime 2016-09-22 00:36:59.000000, size 885 .rgw.root.backup-20160922T003618Z-WORKING/zone_names.default mtime 2016-09-22 00:36:59.000000, size 17
Something is coming along and rewriting the period, more often that it should be, and trashing the old data. It's also very hard to reset back normally. For now I'm copying it back from the backup copy of the pool.
Updated by Robin Johnson over 7 years ago
- Subject changed from RGW loses realm/period/zonegroup/zone data: period overwritten? to RGW loses realm/period/zonegroup/zone data: period overwritten if somewhere in the cluster is still running Hammer
- Priority changed from Urgent to Normal
- Severity changed from 1 - critical to 2 - major
- Release set to jewel
Ok, I traced where it is coming from; and I'd like to explicitly thank Orit for the suggestion.
Two different but related places. In common, both of them are Hammer-era copies of radosgw-admin.
New runs of radosgw-admin, as well as long-running instances of it (in our case a user rm running for more than 30 days) can cause the region objects to be written into the .rgw.root pool again, then something else comes along and converts it, overwriting new period data in the process.
I think that the Jewel RGW runs should have IGNORED the region objects when it detects a conversion has already taken place; and issue a non-fatal warning about region stuff being present.
Updated by Orit Wasserman over 7 years ago
- Status changed from New to Pending Backport
Updated by Loïc Dachary over 7 years ago
- Copied to Backport #17576: jewel: RGW loses realm/period/zonegroup/zone data: period overwritten if somewhere in the cluster is still running Hammer added
Updated by Loïc Dachary over 7 years ago
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved