Project

General

Profile

Bug #17371

RGW loses realm/period/zonegroup/zone data: period overwritten if somewhere in the cluster is still running Hammer

Added by Robin Johnson over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
09/22/2016
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Something in RGW is causing the active realm/zonegroup/zone data to be silently lost or overwritten. I thought I was losing my mind, but then I checked scrollback, and found that it DID show the period data set correctly.

Here is the output of period update --commit, from 2016-09-22/00:35:22.020358 UTC.

root@peon5752:/home/rjohnson/dho-config/congress/tmp# radosgw-admin period update --commit
2016-09-22 00:35:16.649594 7f3d250a9900  0 RGWZoneParams::create(): error creating default zone params: (17) File exists
2016-09-22 00:35:18.660110 7f3d250a9900  0 error read_lastest_epoch .rgw.root:periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.latest_epoch
2016-09-22 00:35:22.020358 7f3d250a9900  1 Set the period's master zonegroup default as the default
{
    "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
    "epoch": 1,
    "predecessor_uuid": "fb6c314c-9e34-4273-8779-d0ab16043532",
    "sync_status": [
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "" 
    ],
    "period_map": {
        "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
        "zonegroups": [
            {
                "id": "default",
                "name": "default",
                "api_name": "us-west-1",
                "is_master": "true",
                "endpoints": [
                    "https:\/\/objects-us-west-1.dream.io",
                    "https:\/\/objects.dreamhost.com" 
                ],
                "hostnames": [
                    "objects-us-west-1.dream.io",
                    "objects.dreamhost.com" 
                ],
                "hostnames_s3website": [
                    "objects-website-us-west-1.dream.io" 
                ],
                "master_zone": "default",
                "zones": [
                    {
                        "id": "default",
                        "name": "default",
                        "endpoints": [],
                        "log_meta": "true",
                        "log_data": "false",
                        "bucket_index_max_shards": 31,
                        "read_only": "false" 
                    }
                ],
                "placement_targets": [
                    {
                        "name": "default-placement",
                        "tags": []
                    }
                ],
                "default_placement": "default-placement",
                "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310" 
            }
        ],
        "short_zone_ids": [
            {
                "key": "default",
                "val": 2610307010
            }
        ]
    },
    "master_zonegroup": "default",
    "master_zone": "default",
    "period_config": {
        "bucket_quota": {
            "enabled": false,
            "max_size_kb": -1,
            "max_objects": -1
        },
        "user_quota": {
            "enabled": false,
            "max_size_kb": -1,
            "max_objects": -1
        }
    },
    "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310",
    "realm_name": "gold",
    "realm_epoch": 2
}

Here is the output from period list & period get as of 2016-09-22/00:56:13.064602 UTC, still good.

rjohnson@peon5752:~$ sudo  radosgw-admin period list
{
    "periods": [
        "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
        "6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging",
        "9e123a4f-5546-4645-b11d-6b21244a5b67",
        "fb6c314c-9e34-4273-8779-d0ab16043532" 
    ]
}

rjohnson@peon5752:~$ sudo  radosgw-admin period get
{
    "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
    "epoch": 1,
    "predecessor_uuid": "fb6c314c-9e34-4273-8779-d0ab16043532",
    "sync_status": [
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "" 
    ],
    "period_map": {
        "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
        "zonegroups": [
            {
                "id": "default",
                "name": "default",
                "api_name": "us-west-1",
                "is_master": "true",
                "endpoints": [
                    "https:\/\/objects-us-west-1.dream.io",
                    "https:\/\/objects.dreamhost.com" 
                ],
                "hostnames": [
                    "objects-us-west-1.dream.io",
                    "objects.dreamhost.com" 
                ],
                "hostnames_s3website": [
                    "objects-website-us-west-1.dream.io" 
                ],
                "master_zone": "default",
                "zones": [
                    {
                        "id": "default",
                        "name": "default",
                        "endpoints": [],
                        "log_meta": "true",
                        "log_data": "false",
                        "bucket_index_max_shards": 31,
                        "read_only": "false" 
                    }
                ],
                "placement_targets": [
                    {
                        "name": "default-placement",
                        "tags": []
                    }
                ],
                "default_placement": "default-placement",
                "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310" 
            }
        ],
        "short_zone_ids": [
            {
                "key": "default",
                "val": 2610307010
            }
        ]
    },
    "master_zonegroup": "default",
    "master_zone": "default",
    "period_config": {
        "bucket_quota": {
            "enabled": false,
            "max_size_kb": -1,
            "max_objects": -1
        },
        "user_quota": {
            "enabled": false,
            "max_size_kb": -1,
            "max_objects": -1
        }
    },
    "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310",
    "realm_name": "gold",
    "realm_epoch": 2
}

Follow followed by the output of the same as of 2016-09-22/10:34:08 same day, now it's GONE.

root@peon5752:/home/rjohnson#  date -u
Thu Sep 22 10:34:08 UTC 2016
root@peon5752:/home/rjohnson# radosgw-admin period list
{
    "periods": [
        "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
        "6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging",
        "9e123a4f-5546-4645-b11d-6b21244a5b67",
        "fb6c314c-9e34-4273-8779-d0ab16043532" 
    ]
}

root@peon5752:/home/rjohnson# radosgw-admin period get
{
    "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
    "epoch": 1,
    "predecessor_uuid": "fb6c314c-9e34-4273-8779-d0ab16043532",
    "sync_status": [
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "" 
    ],
    "period_map": {
        "id": "68a89bcc-6769-498e-9da1-91fe9bfb72e5",
        "zonegroups": [
            {
                "id": "default",
                "name": "default",
                "api_name": "",
                "is_master": "true",
                "endpoints": [],
                "hostnames": [],
                "hostnames_s3website": [],
                "master_zone": "",
                "zones": [
                    {
                        "id": "default",
                        "name": "default",
                        "endpoints": [],
                        "log_meta": "false",
                        "log_data": "false",
                        "bucket_index_max_shards": 0,
                        "read_only": "false" 
                    }
                ],
                "placement_targets": [
                    {
                        "name": "default-placement",
                        "tags": []
                    }
                ],
                "default_placement": "default-placement",
                "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310" 
            }
        ],
        "short_zone_ids": [
            {
                "key": "default",
                "val": 2610307010
            }
        ]
    },
    "master_zonegroup": "default",
    "master_zone": "default",
    "period_config": {
        "bucket_quota": {
            "enabled": false,
            "max_size_kb": -1,
            "max_objects": -1
        },
        "user_quota": {
            "enabled": false,
            "max_size_kb": -1,
            "max_objects": -1
        }
    },
    "realm_id": "6ed37b4b-66ba-4a6a-a464-80d7490cb310",
    "realm_name": "gold",
    "realm_epoch": 2
}

Related issues

Copied to rgw - Backport #17576: jewel: RGW loses realm/period/zonegroup/zone data: period overwritten if somewhere in the cluster is still running Hammer Resolved

History

#1 Updated by Robin Johnson over 2 years ago

stat on the contents of the .rgw.root pool shows some interesting mtimes that correspond with the times above.

.rgw.root/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310.control mtime 2016-09-22 00:33:48.000000, size 0
.rgw.root/periods.fb6c314c-9e34-4273-8779-d0ab16043532.latest_epoch mtime 2016-09-22 00:33:48.000000, size 10
.rgw.root/periods.fb6c314c-9e34-4273-8779-d0ab16043532.1 mtime 2016-09-22 00:33:48.000000, size 228
.rgw.root/default.realm mtime 2016-09-22 00:33:48.000000, size 46
.rgw.root/realms_names.gold mtime 2016-09-22 00:33:48.000000, size 46
.rgw.root/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.latest_epoch mtime 2016-09-22 00:34:05.000000, size 10
.rgw.root/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.1 mtime 2016-09-22 00:34:05.000000, size 228
.rgw.root/default.zone.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:34:39.000000, size 17
.rgw.root/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging.latest_epoch mtime 2016-09-22 00:35:02.000000, size 10
.rgw.root/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging mtime 2016-09-22 00:35:16.000000, size 736
.rgw.root/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.latest_epoch mtime 2016-09-22 00:35:18.000000, size 10
.rgw.root/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:35:18.000000, size 104
.rgw.root/default.zonegroup.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 10:22:00.000000, size 17
.rgw.root/zonegroups_names.default mtime 2016-09-22 10:22:00.000000, size 17
.rgw.root/zone_names.default mtime 2016-09-22 10:22:00.000000, size 17
.rgw.root/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.1 mtime 2016-09-22 10:22:00.000000, size 804
.rgw.root/zone_info.default mtime 2016-09-22 10:22:26.000000, size 713
.rgw.root/zonegroup_info.default mtime 2016-09-22 10:22:36.000000, size 244

Also, while the RGW nodes are running 10.2.3, some of the other nodes are running 10.2.2, so I wonder if this is related to #16627 or #17051.

#2 Updated by Robin Johnson over 2 years ago

Also Also, I took a backup of .rgw.root after the last time it was set, and here's the mtimes to compare.

# rados -p .rgw.root.backup-20160922T003618Z-WORKING ls |xargs -n1 rados -p .rgw.root.backup-20160922T003618Z-WORKING stat |sort -k +4 -n
.rgw.root.backup-20160922T003618Z-WORKING/default.realm mtime 2016-09-22 00:36:58.000000, size 46
.rgw.root.backup-20160922T003618Z-WORKING/default.zone.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:36:44.000000, size 17
.rgw.root.backup-20160922T003618Z-WORKING/default.zonegroup.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:36:48.000000, size 17
.rgw.root.backup-20160922T003618Z-WORKING/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.1 mtime 2016-09-22 00:36:48.000000, size 984
.rgw.root.backup-20160922T003618Z-WORKING/periods.68a89bcc-6769-498e-9da1-91fe9bfb72e5.latest_epoch mtime 2016-09-22 00:36:59.000000, size 10
.rgw.root.backup-20160922T003618Z-WORKING/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging.latest_epoch mtime 2016-09-22 00:36:48.000000, size 10
.rgw.root.backup-20160922T003618Z-WORKING/periods.6ed37b4b-66ba-4a6a-a464-80d7490cb310:staging mtime 2016-09-22 00:36:44.000000, size 736
.rgw.root.backup-20160922T003618Z-WORKING/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.1 mtime 2016-09-22 00:36:48.000000, size 228
.rgw.root.backup-20160922T003618Z-WORKING/periods.9e123a4f-5546-4645-b11d-6b21244a5b67.latest_epoch mtime 2016-09-22 00:36:59.000000, size 10
.rgw.root.backup-20160922T003618Z-WORKING/periods.fb6c314c-9e34-4273-8779-d0ab16043532.1 mtime 2016-09-22 00:36:58.000000, size 228
.rgw.root.backup-20160922T003618Z-WORKING/periods.fb6c314c-9e34-4273-8779-d0ab16043532.latest_epoch mtime 2016-09-22 00:36:59.000000, size 10
.rgw.root.backup-20160922T003618Z-WORKING/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310.control mtime 2016-09-22 00:36:59.000000, size 0
.rgw.root.backup-20160922T003618Z-WORKING/realms.6ed37b4b-66ba-4a6a-a464-80d7490cb310 mtime 2016-09-22 00:36:42.000000, size 104
.rgw.root.backup-20160922T003618Z-WORKING/realms_names.gold mtime 2016-09-22 00:36:59.000000, size 46
.rgw.root.backup-20160922T003618Z-WORKING/zonegroup_info.default mtime 2016-09-22 00:36:44.000000, size 417
.rgw.root.backup-20160922T003618Z-WORKING/zonegroups_names.default mtime 2016-09-22 00:36:59.000000, size 17
.rgw.root.backup-20160922T003618Z-WORKING/zone_info.default mtime 2016-09-22 00:36:59.000000, size 885
.rgw.root.backup-20160922T003618Z-WORKING/zone_names.default mtime 2016-09-22 00:36:59.000000, size 17

Something is coming along and rewriting the period, more often that it should be, and trashing the old data. It's also very hard to reset back normally. For now I'm copying it back from the backup copy of the pool.

#3 Updated by Robin Johnson over 2 years ago

  • Subject changed from RGW loses realm/period/zonegroup/zone data: period overwritten? to RGW loses realm/period/zonegroup/zone data: period overwritten if somewhere in the cluster is still running Hammer
  • Priority changed from Urgent to Normal
  • Severity changed from 1 - critical to 2 - major
  • Release set to jewel

Ok, I traced where it is coming from; and I'd like to explicitly thank Orit for the suggestion.

Two different but related places. In common, both of them are Hammer-era copies of radosgw-admin.
New runs of radosgw-admin, as well as long-running instances of it (in our case a user rm running for more than 30 days) can cause the region objects to be written into the .rgw.root pool again, then something else comes along and converts it, overwriting new period data in the process.

I think that the Jewel RGW runs should have IGNORED the region objects when it detects a conversion has already taken place; and issue a non-fatal warning about region stuff being present.

#4 Updated by Yehuda Sadeh over 2 years ago

  • Assignee set to Orit Wasserman

#5 Updated by Orit Wasserman over 2 years ago

  • Backport set to jewel

#6 Updated by Orit Wasserman over 2 years ago

  • Status changed from New to Pending Backport

#7 Updated by Loic Dachary over 2 years ago

  • Copied to Backport #17576: jewel: RGW loses realm/period/zonegroup/zone data: period overwritten if somewhere in the cluster is still running Hammer added

#9 Updated by Loic Dachary about 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF