Project

General

Profile

Actions

Bug #64255

closed

rgw-multisite: "Unable to load site config" error during multisite setup

Added by Shilpa MJ 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Unable to initialize site config.2024-01-30T14:52:17.593-0500 7fbd95bf6c00 0 ERROR: zonegroup d466c415-af90-455c-b294-b66028ffb998 does not contain zone id dd64937f-db8a-4872-ac1e-8a6f74ae7017

Steps to reproduce:
1. create realm
2. create default zonegroup
3. create master zone
4. run period update --commit

happens while running "period update --commit" after the master zone is created.

the local_zonegroup does not contain the zone id and so the lookup fails here:

https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_zone.cc#L1298

Actions #1

Updated by Casey Bodley 3 months ago

i added -x to test-rgw-multisite.sh to see which commands it was running

diff --git a/src/test/rgw/test-rgw-multisite.sh b/src/test/rgw/test-rgw-multisite.sh
index a005b19e3da..68f7afae9cf 100755
--- a/src/test/rgw/test-rgw-multisite.sh
+++ b/src/test/rgw/test-rgw-multisite.sh
@@ -1,4 +1,5 @@
 #!/usr/bin/env bash
+set -x

 [ $# -lt 1 ] && echo "usage: $0 <num-clusters> [rgw parameters...]" && exit 1


running that with:
~/ceph/build $ MON=1 OSD=1 RGW=0 MDS=0 MGR=0 ../src/test/rgw/test-rgw-multisite.sh 2

shows that 'user create' is responsible for the error
++ /home/cbodley/ceph/src/mrun c1 radosgw-admin user create --uid=zone.user --display-name=ZoneUser --access-key=1234567890 --secret=pencil --system
2024-01-31T13:30:35.018-0500 7f8210975ec0 -1 WARNING: all dangerous and experimental features are enabled.
2024-01-31T13:30:35.023-0500 7f8210975ec0 -1 WARNING: all dangerous and experimental features are enabled.
2024-01-31T13:30:35.043-0500 7f8210975ec0 -1 WARNING: all dangerous and experimental features are enabled.
Unable to initialize site config.2024-01-31T13:30:35.060-0500 7f8210975ec0  0 ERROR: current period e60d3352-098d-411c-a7df-58bebb79aee1 does not contain zone id b2c2d10e-f946-4256-991d-e4e45ba044a1

Actions #2

Updated by Casey Bodley 3 months ago

with the cluster in this state, i commented out the call to SiteConfig::load() to see what RGWSI_Zone::do_start() did differently

diff --git a/src/rgw/rgw_admin.cc b/src/rgw/rgw_admin.cc
index 8265852973f..fabe49d287e 100644
--- a/src/rgw/rgw_admin.cc
+++ b/src/rgw/rgw_admin.cc
@@ -4270,12 +4270,13 @@ int main(int argc, const char **argv)
                                              cfg, context_pool, *site);
     } else {
       site = std::make_unique<rgw::SiteConfig>();
+#if 0
       auto r = site->load(dpp(), null_yield, cfgstore.get(), localzonegroup_op);
       if (r < 0) {
        std::cerr << "Unable to initialize site config." << std::endl;
        exit(1);
       }
-
+#endif
       driver = DriverManager::get_storage(dpp(),
                                        g_ceph_context,
                                        cfg,

the same 'user create' command succeeds, with this log output from RGWSI_Zone:
2024-01-31T13:56:08.282-0500 7f553c69fec0  0 period (e60d3352-098d-411c-a7df-58bebb79aee1 does not have zone b2c2d10e-f946-4256-991d-e4e45ba044a1 configured
2024-01-31T13:56:08.282-0500 7f553c69fec0 20 searching for the correct realm
...
2024-01-31T13:56:08.294-0500 7f553c69fec0 20 zone zg1-1 found
2024-01-31T13:56:08.294-0500 7f553c69fec0  4 Realm:     earth                (cef9d447-8d78-4e2f-ba81-206bc52be7b5)
2024-01-31T13:56:08.294-0500 7f553c69fec0  4 ZoneGroup: zg1                  (0fc4a100-867c-4c3b-909d-1c8510bbb2b2)
2024-01-31T13:56:08.294-0500 7f553c69fec0  4 Zone:      zg1-1                (b2c2d10e-f946-4256-991d-e4e45ba044a1)
2024-01-31T13:56:08.294-0500 7f553c69fec0 10 cannot find current period zonegroup using local zonegroup configuration
2024-01-31T13:56:08.294-0500 7f553c69fec0 20 zonegroup zg1

Actions #3

Updated by Casey Bodley 3 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Casey Bodley
  • Pull request ID set to 55406
Actions #4

Updated by Shilpa MJ 3 months ago

thanks @Casey Bodley. But I'm quite confused by the different outcomes from running the test-rgw-multisite.sh script and configuring multisite by hand with the fix installed.

running user create command still fails here:

smanjara:build$ ../src/mrun c1 radosgw-admin user create --uid=zone.user --display-name=ZoneUser --access-key 1234567890 --secret pencil --system
2024-02-01T09:52:51.582-0500 7f61cf1d9c00 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-01T09:52:51.590-0500 7f61cf1d9c00 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-01T09:52:51.610-0500 7f61cf1d9c00 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-01T09:52:51.629-0500 7f61cf1d9c00 0 ERROR: current period 4d666f0c-ff2c-4f82-ac37-c23b7c2bc8e8 does not contain zone id 373a25b6-db7a-4c3c-9d0f-a93ab4036739
Unable to initialize site config.
2024-02-01T09:52:51.630-0500 7f61cf1d9c00 0 ERROR: zonegroup 1732f50f-679f-4295-b597-fae7b4a06753 does not contain zone id 373a25b6-db7a-4c3c-9d0f-a93ab4036739
smanjara:build$

above, we don't call the RGWSI_Zone::do_start(), but exit. but when I run the script I can see the call to do_start() where we see a different error below and the command succeeds.

2024-01-31T17:08:07.871-0500 7ff16066ac00 0 ERROR: current period eb2401f7-87de-4ddf-bf48-eb04928b0c5c does not contain zone id fe5daf03-777e-434d-93f8-ecbc619043a6
2024-01-31T17:08:07.914-0500 7ff16066ac00 0 period (eb2401f7-87de-4ddf-bf48-eb04928b0c5c does not have zone fe5daf03-777e-434d-93f8-ecbc619043a6 configured
"email": "",
"suspended": 0,
"max_buckets": 1000,
"subusers": [],
"keys": [ {
"user": "zone.user",
"access_key": "1234567890",
"secret_key": "pencil"
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"system": true,
"default_placement": "",
"default_storage_class": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}

Actions #5

Updated by Casey Bodley 3 months ago

ok, the fallback to SiteConfig::load_local_zonegroup() was just busted. it shouldn't call read_or_create_default_zonegroup() when we're in a realm

even when this "user create" command succeeded, it was because read_or_create_default_zonegroup() created a new zonegroup named "default" and put our zone in it:

2024-02-02T15:06:50.906-0500 7f31c6ca4ec0  0 ERROR: current period 26c57655-b0da-4348-84eb-a67416ad9617 does not contain zone id a2d05b9e-34ef-421c-8c46-0d9f316eee37
2024-02-02T15:06:50.906-0500 7f31c6ca4ec0 10 cannot find current period zonegroup, using local zonegroup configuration
2024-02-02T15:06:50.906-0500 7f31c6ca4ec0  1 -- 192.168.245.130:0/2137288179 --> [v2:192.168.245.130:6800/3461689424,v1:192.168.245.130:6801/3461689424] -- osd_op(unknown.0.0:7 1.0 1:f4c53578:::zonegroups_names.default:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e9) v8 -- 0x55efc1700ab0 con 0x55efc16f9d70
2024-02-02T15:06:50.906-0500 7f31bdee86c0  1 -- 192.168.245.130:0/2137288179 <== osd.0 v2:192.168.245.130:6800/3461689424 7 ==== osd_op_reply(7 zonegroups_names.default [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 168+0+0 (crc 0 0 0) 0x7f31b40083b0 con 0x55efc16f9d70
2024-02-02T15:06:50.906-0500 7f31c6ca4ec0  1 -- 192.168.245.130:0/2137288179 --> [v2:192.168.245.130:6800/3461689424,v1:192.168.245.130:6801/3461689424] -- osd_op(unknown.0.0:8 1.0 1:6afb4555:::zonegroup_info.1bac1071-6238-46e9-a3a4-50260ff4b6a0:head [create,call version.set in=58b,writefull 0~392 in=392b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e9) v8 -- 0x55efc1703e40 con 0x55efc16f9d70
2024-02-02T15:06:50.909-0500 7f31bdee86c0  1 -- 192.168.245.130:0/2137288179 <== osd.0 v2:192.168.245.130:6800/3461689424 8 ==== osd_op_reply(8 zonegroup_info.1bac1071-6238-46e9-a3a4-50260ff4b6a0 [create,call,writefull 0~392] v9'18 uv18 ondisk = 0) v8 ==== 279+0+0 (crc 0 0 0) 0x7f31b40083b0 con 0x55efc16f9d70
2024-02-02T15:06:50.909-0500 7f31c6ca4ec0  1 -- 192.168.245.130:0/2137288179 --> [v2:192.168.245.130:6800/3461689424,v1:192.168.245.130:6801/3461689424] -- osd_op(unknown.0.0:9 1.0 1:f4c53578:::zonegroups_names.default:head [create,call version.set in=58b,writefull 0~46 in=46b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e9) v8 -- 0x55efc1704210 con 0x55efc16f9d70
2024-02-02T15:06:50.911-0500 7f31bdee86c0  1 -- 192.168.245.130:0/2137288179 <== osd.0 v2:192.168.245.130:6800/3461689424 9 ==== osd_op_reply(9 zonegroups_names.default [create,call,writefull 0~46] v9'19 uv19 ondisk = 0) v8 ==== 252+0+0 (crc 0 0 0) 0x7f31b40083b0 con 0x55efc16f9d70

if that "default" zonegroup already exists, the "user create" command fails because it doesn't find our zone in it:

2024-02-02T15:10:15.936-0500 7fde1a461ec0  0 ERROR: current period 820dfae7-ca82-4aeb-98ac-2efd8df5e226 does not contain zone id 0cbbcf95-8c57-4d2f-9008-1f071b7c9469
2024-02-02T15:10:15.936-0500 7fde1a461ec0 10 cannot find current period zonegroup, using local zonegroup configuration
2024-02-02T15:10:15.936-0500 7fde1a461ec0  1 -- 192.168.245.130:0/2947114708 --> [v2:192.168.245.130:6800/3819185730,v1:192.168.245.130:6801/3819185730] -- osd_op(unknown.0.0:7 1.0 1:f4c53578:::zonegroups_names.default:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e13) v8 -- 0x55a4fb928100 con 0x55a4fb9215d0
2024-02-02T15:10:15.937-0500 7fde111566c0  1 -- 192.168.245.130:0/2947114708 <== osd.0 v2:192.168.245.130:6800/3819185730 7 ==== osd_op_reply(7 zonegroups_names.default [read 0~46 out=46b] v0'0 uv5 ondisk = 0) v8 ==== 168+0+46 (crc 0 0 0) 0x7fddfc0083b0 con 0x55a4fb9215d0
2024-02-02T15:10:15.937-0500 7fde1a461ec0  1 -- 192.168.245.130:0/2947114708 --> [v2:192.168.245.130:6800/3819185730,v1:192.168.245.130:6801/3819185730] -- osd_op(unknown.0.0:8 1.0 1:5a5e6545:::zonegroup_info.0ffd7d18-1ff0-4c40-8557-a024f6de9b60:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e13) v8 -- 0x55a4fb929470 con 0x55a4fb9215d0
2024-02-02T15:10:15.937-0500 7fde111566c0  1 -- 192.168.245.130:0/2947114708 <== osd.0 v2:192.168.245.130:6800/3819185730 8 ==== osd_op_reply(8 zonegroup_info.0ffd7d18-1ff0-4c40-8557-a024f6de9b60 [call out=48b,read 0~398 out=398b] v0'0 uv4 ondisk = 0) v8 ==== 237+0+446 (crc 0 0 0) 0x7fddfc0083b0 con 0x55a4fb9215d0
2024-02-02T15:10:15.937-0500 7fde1a461ec0  0 ERROR: zonegroup 0ffd7d18-1ff0-4c40-8557-a024f6de9b60 does not contain zone id 0cbbcf95-8c57-4d2f-9008-1f071b7c9469
Unable to initialize site config.

instead of calling read_or_create_default_zonegroup() there, we need to call cfgstore->read_default_zonegroup() which loads whatever zonegroup was created with the --default option:

2024-02-02T15:01:28.158-0500 7fe022856ec0  0 ERROR: current period 3533f90d-06b7-4a26-ab3a-f8789a6946f7 does not contain zone id 83d92c47-2f73-4aa1-9b24-66b4012c8158
2024-02-02T15:01:28.158-0500 7fe022856ec0 10 cannot find current period zonegroup, using local zonegroup configuration
2024-02-02T15:01:28.158-0500 7fe022856ec0  1 -- 192.168.245.130:0/4032087907 --> [v2:192.168.245.130:6800/2192218050,v1:192.168.245.130:6801/2192218050] -- osd_op(unknown.0.0:7 1.0 1:e49e4530:::default.zonegroup.c3670afd-2dd0-4f43-855e-cd7fd9c85bcc:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+suppor
ts_pool_eio e9) v8 -- 0x56221d51e9c0 con 0x56221d517c50
2024-02-02T15:01:28.159-0500 7fe0194f16c0  1 -- 192.168.245.130:0/4032087907 <== osd.0 v2:192.168.245.130:6800/2192218050 7 ==== osd_op_reply(7 default.zonegroup.c3670afd-2dd0-4f43-855e-cd7fd9c85bcc [read 0~46 out=46b] v0'0 uv13 ondisk = 0) v8 ==== 198+0+46 (crc 0 0 0) 0x7fe0040083b0 con 0x56221d517c50
2024-02-02T15:01:28.159-0500 7fe022856ec0  1 -- 192.168.245.130:0/4032087907 --> [v2:192.168.245.130:6800/2192218050,v1:192.168.245.130:6801/2192218050] -- osd_op(unknown.0.0:8 1.0 1:46e31d43:::zonegroup_info.b988c038-da81-4012-9de5-5cdac2211379:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e9) v8 -- 0x56221d51fd30 con 0x56221d517c50
2024-02-02T15:01:28.160-0500 7fe0194f16c0  1 -- 192.168.245.130:0/4032087907 <== osd.0 v2:192.168.245.130:6800/2192218050 8 ==== osd_op_reply(8 zonegroup_info.b988c038-da81-4012-9de5-5cdac2211379 [call out=48b,read 0~444 out=444b] v0'0 uv17 ondisk = 0) v8 ==== 237+0+492 (crc 0 0 0) 0x7fe0040083b0 con 0x56221d517c50

i updated https://github.com/ceph/ceph/pull/55406 with this fix

Actions #6

Updated by Casey Bodley 3 months ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF