Bug #64255
closedrgw-multisite: "Unable to load site config" error during multisite setup
0%
Description
Unable to initialize site config.2024-01-30T14:52:17.593-0500 7fbd95bf6c00 0 ERROR: zonegroup d466c415-af90-455c-b294-b66028ffb998 does not contain zone id dd64937f-db8a-4872-ac1e-8a6f74ae7017
Steps to reproduce:
1. create realm
2. create default zonegroup
3. create master zone
4. run period update --commit
happens while running "period update --commit" after the master zone is created.
the local_zonegroup does not contain the zone id and so the lookup fails here:
https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_zone.cc#L1298
Updated by Casey Bodley 3 months ago
i added -x
to test-rgw-multisite.sh
to see which commands it was running
diff --git a/src/test/rgw/test-rgw-multisite.sh b/src/test/rgw/test-rgw-multisite.sh
index a005b19e3da..68f7afae9cf 100755
--- a/src/test/rgw/test-rgw-multisite.sh
+++ b/src/test/rgw/test-rgw-multisite.sh
@@ -1,4 +1,5 @@
#!/usr/bin/env bash
+set -x
[ $# -lt 1 ] && echo "usage: $0 <num-clusters> [rgw parameters...]" && exit 1
running that with:
~/ceph/build $ MON=1 OSD=1 RGW=0 MDS=0 MGR=0 ../src/test/rgw/test-rgw-multisite.sh 2
shows that 'user create' is responsible for the error
++ /home/cbodley/ceph/src/mrun c1 radosgw-admin user create --uid=zone.user --display-name=ZoneUser --access-key=1234567890 --secret=pencil --system 2024-01-31T13:30:35.018-0500 7f8210975ec0 -1 WARNING: all dangerous and experimental features are enabled. 2024-01-31T13:30:35.023-0500 7f8210975ec0 -1 WARNING: all dangerous and experimental features are enabled. 2024-01-31T13:30:35.043-0500 7f8210975ec0 -1 WARNING: all dangerous and experimental features are enabled. Unable to initialize site config.2024-01-31T13:30:35.060-0500 7f8210975ec0 0 ERROR: current period e60d3352-098d-411c-a7df-58bebb79aee1 does not contain zone id b2c2d10e-f946-4256-991d-e4e45ba044a1
Updated by Casey Bodley 3 months ago
with the cluster in this state, i commented out the call to SiteConfig::load()
to see what RGWSI_Zone::do_start()
did differently
diff --git a/src/rgw/rgw_admin.cc b/src/rgw/rgw_admin.cc
index 8265852973f..fabe49d287e 100644
--- a/src/rgw/rgw_admin.cc
+++ b/src/rgw/rgw_admin.cc
@@ -4270,12 +4270,13 @@ int main(int argc, const char **argv)
cfg, context_pool, *site);
} else {
site = std::make_unique<rgw::SiteConfig>();
+#if 0
auto r = site->load(dpp(), null_yield, cfgstore.get(), localzonegroup_op);
if (r < 0) {
std::cerr << "Unable to initialize site config." << std::endl;
exit(1);
}
-
+#endif
driver = DriverManager::get_storage(dpp(),
g_ceph_context,
cfg,
the same 'user create' command succeeds, with this log output from
RGWSI_Zone
:2024-01-31T13:56:08.282-0500 7f553c69fec0 0 period (e60d3352-098d-411c-a7df-58bebb79aee1 does not have zone b2c2d10e-f946-4256-991d-e4e45ba044a1 configured 2024-01-31T13:56:08.282-0500 7f553c69fec0 20 searching for the correct realm ... 2024-01-31T13:56:08.294-0500 7f553c69fec0 20 zone zg1-1 found 2024-01-31T13:56:08.294-0500 7f553c69fec0 4 Realm: earth (cef9d447-8d78-4e2f-ba81-206bc52be7b5) 2024-01-31T13:56:08.294-0500 7f553c69fec0 4 ZoneGroup: zg1 (0fc4a100-867c-4c3b-909d-1c8510bbb2b2) 2024-01-31T13:56:08.294-0500 7f553c69fec0 4 Zone: zg1-1 (b2c2d10e-f946-4256-991d-e4e45ba044a1) 2024-01-31T13:56:08.294-0500 7f553c69fec0 10 cannot find current period zonegroup using local zonegroup configuration 2024-01-31T13:56:08.294-0500 7f553c69fec0 20 zonegroup zg1
Updated by Casey Bodley 3 months ago
- Status changed from New to Fix Under Review
- Assignee set to Casey Bodley
- Pull request ID set to 55406
Updated by Shilpa MJ 3 months ago
thanks @Casey Bodley. But I'm quite confused by the different outcomes from running the test-rgw-multisite.sh script and configuring multisite by hand with the fix installed.
running user create command still fails here:
smanjara:build$ ../src/mrun c1 radosgw-admin user create --uid=zone.user --display-name=ZoneUser --access-key 1234567890 --secret pencil --system
2024-02-01T09:52:51.582-0500 7f61cf1d9c00 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-01T09:52:51.590-0500 7f61cf1d9c00 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-01T09:52:51.610-0500 7f61cf1d9c00 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-01T09:52:51.629-0500 7f61cf1d9c00 0 ERROR: current period 4d666f0c-ff2c-4f82-ac37-c23b7c2bc8e8 does not contain zone id 373a25b6-db7a-4c3c-9d0f-a93ab4036739
Unable to initialize site config.
2024-02-01T09:52:51.630-0500 7f61cf1d9c00 0 ERROR: zonegroup 1732f50f-679f-4295-b597-fae7b4a06753 does not contain zone id 373a25b6-db7a-4c3c-9d0f-a93ab4036739
smanjara:build$
above, we don't call the RGWSI_Zone::do_start(), but exit. but when I run the script I can see the call to do_start() where we see a different error below and the command succeeds.
2024-01-31T17:08:07.871-0500 7ff16066ac00 0 ERROR: current period eb2401f7-87de-4ddf-bf48-eb04928b0c5c does not contain zone id fe5daf03-777e-434d-93f8-ecbc619043a6
2024-01-31T17:08:07.914-0500 7ff16066ac00 0 period (eb2401f7-87de-4ddf-bf48-eb04928b0c5c does not have zone fe5daf03-777e-434d-93f8-ecbc619043a6 configured
"email": "",
"suspended": 0,
"max_buckets": 1000,
"subusers": [],
"keys": [
{
"user": "zone.user",
"access_key": "1234567890",
"secret_key": "pencil"
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"system": true,
"default_placement": "",
"default_storage_class": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}
Updated by Casey Bodley 3 months ago
ok, the fallback to SiteConfig::load_local_zonegroup()
was just busted. it shouldn't call read_or_create_default_zonegroup()
when we're in a realm
even when this "user create" command succeeded, it was because read_or_create_default_zonegroup()
created a new zonegroup named "default" and put our zone in it:
2024-02-02T15:06:50.906-0500 7f31c6ca4ec0 0 ERROR: current period 26c57655-b0da-4348-84eb-a67416ad9617 does not contain zone id a2d05b9e-34ef-421c-8c46-0d9f316eee37 2024-02-02T15:06:50.906-0500 7f31c6ca4ec0 10 cannot find current period zonegroup, using local zonegroup configuration 2024-02-02T15:06:50.906-0500 7f31c6ca4ec0 1 -- 192.168.245.130:0/2137288179 --> [v2:192.168.245.130:6800/3461689424,v1:192.168.245.130:6801/3461689424] -- osd_op(unknown.0.0:7 1.0 1:f4c53578:::zonegroups_names.default:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e9) v8 -- 0x55efc1700ab0 con 0x55efc16f9d70 2024-02-02T15:06:50.906-0500 7f31bdee86c0 1 -- 192.168.245.130:0/2137288179 <== osd.0 v2:192.168.245.130:6800/3461689424 7 ==== osd_op_reply(7 zonegroups_names.default [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 168+0+0 (crc 0 0 0) 0x7f31b40083b0 con 0x55efc16f9d70 2024-02-02T15:06:50.906-0500 7f31c6ca4ec0 1 -- 192.168.245.130:0/2137288179 --> [v2:192.168.245.130:6800/3461689424,v1:192.168.245.130:6801/3461689424] -- osd_op(unknown.0.0:8 1.0 1:6afb4555:::zonegroup_info.1bac1071-6238-46e9-a3a4-50260ff4b6a0:head [create,call version.set in=58b,writefull 0~392 in=392b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e9) v8 -- 0x55efc1703e40 con 0x55efc16f9d70 2024-02-02T15:06:50.909-0500 7f31bdee86c0 1 -- 192.168.245.130:0/2137288179 <== osd.0 v2:192.168.245.130:6800/3461689424 8 ==== osd_op_reply(8 zonegroup_info.1bac1071-6238-46e9-a3a4-50260ff4b6a0 [create,call,writefull 0~392] v9'18 uv18 ondisk = 0) v8 ==== 279+0+0 (crc 0 0 0) 0x7f31b40083b0 con 0x55efc16f9d70 2024-02-02T15:06:50.909-0500 7f31c6ca4ec0 1 -- 192.168.245.130:0/2137288179 --> [v2:192.168.245.130:6800/3461689424,v1:192.168.245.130:6801/3461689424] -- osd_op(unknown.0.0:9 1.0 1:f4c53578:::zonegroups_names.default:head [create,call version.set in=58b,writefull 0~46 in=46b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e9) v8 -- 0x55efc1704210 con 0x55efc16f9d70 2024-02-02T15:06:50.911-0500 7f31bdee86c0 1 -- 192.168.245.130:0/2137288179 <== osd.0 v2:192.168.245.130:6800/3461689424 9 ==== osd_op_reply(9 zonegroups_names.default [create,call,writefull 0~46] v9'19 uv19 ondisk = 0) v8 ==== 252+0+0 (crc 0 0 0) 0x7f31b40083b0 con 0x55efc16f9d70
if that "default" zonegroup already exists, the "user create" command fails because it doesn't find our zone in it:
2024-02-02T15:10:15.936-0500 7fde1a461ec0 0 ERROR: current period 820dfae7-ca82-4aeb-98ac-2efd8df5e226 does not contain zone id 0cbbcf95-8c57-4d2f-9008-1f071b7c9469 2024-02-02T15:10:15.936-0500 7fde1a461ec0 10 cannot find current period zonegroup, using local zonegroup configuration 2024-02-02T15:10:15.936-0500 7fde1a461ec0 1 -- 192.168.245.130:0/2947114708 --> [v2:192.168.245.130:6800/3819185730,v1:192.168.245.130:6801/3819185730] -- osd_op(unknown.0.0:7 1.0 1:f4c53578:::zonegroups_names.default:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e13) v8 -- 0x55a4fb928100 con 0x55a4fb9215d0 2024-02-02T15:10:15.937-0500 7fde111566c0 1 -- 192.168.245.130:0/2947114708 <== osd.0 v2:192.168.245.130:6800/3819185730 7 ==== osd_op_reply(7 zonegroups_names.default [read 0~46 out=46b] v0'0 uv5 ondisk = 0) v8 ==== 168+0+46 (crc 0 0 0) 0x7fddfc0083b0 con 0x55a4fb9215d0 2024-02-02T15:10:15.937-0500 7fde1a461ec0 1 -- 192.168.245.130:0/2947114708 --> [v2:192.168.245.130:6800/3819185730,v1:192.168.245.130:6801/3819185730] -- osd_op(unknown.0.0:8 1.0 1:5a5e6545:::zonegroup_info.0ffd7d18-1ff0-4c40-8557-a024f6de9b60:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e13) v8 -- 0x55a4fb929470 con 0x55a4fb9215d0 2024-02-02T15:10:15.937-0500 7fde111566c0 1 -- 192.168.245.130:0/2947114708 <== osd.0 v2:192.168.245.130:6800/3819185730 8 ==== osd_op_reply(8 zonegroup_info.0ffd7d18-1ff0-4c40-8557-a024f6de9b60 [call out=48b,read 0~398 out=398b] v0'0 uv4 ondisk = 0) v8 ==== 237+0+446 (crc 0 0 0) 0x7fddfc0083b0 con 0x55a4fb9215d0 2024-02-02T15:10:15.937-0500 7fde1a461ec0 0 ERROR: zonegroup 0ffd7d18-1ff0-4c40-8557-a024f6de9b60 does not contain zone id 0cbbcf95-8c57-4d2f-9008-1f071b7c9469 Unable to initialize site config.
instead of calling read_or_create_default_zonegroup()
there, we need to call cfgstore->read_default_zonegroup()
which loads whatever zonegroup was created with the --default
option:
2024-02-02T15:01:28.158-0500 7fe022856ec0 0 ERROR: current period 3533f90d-06b7-4a26-ab3a-f8789a6946f7 does not contain zone id 83d92c47-2f73-4aa1-9b24-66b4012c8158 2024-02-02T15:01:28.158-0500 7fe022856ec0 10 cannot find current period zonegroup, using local zonegroup configuration 2024-02-02T15:01:28.158-0500 7fe022856ec0 1 -- 192.168.245.130:0/4032087907 --> [v2:192.168.245.130:6800/2192218050,v1:192.168.245.130:6801/2192218050] -- osd_op(unknown.0.0:7 1.0 1:e49e4530:::default.zonegroup.c3670afd-2dd0-4f43-855e-cd7fd9c85bcc:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+suppor ts_pool_eio e9) v8 -- 0x56221d51e9c0 con 0x56221d517c50 2024-02-02T15:01:28.159-0500 7fe0194f16c0 1 -- 192.168.245.130:0/4032087907 <== osd.0 v2:192.168.245.130:6800/2192218050 7 ==== osd_op_reply(7 default.zonegroup.c3670afd-2dd0-4f43-855e-cd7fd9c85bcc [read 0~46 out=46b] v0'0 uv13 ondisk = 0) v8 ==== 198+0+46 (crc 0 0 0) 0x7fe0040083b0 con 0x56221d517c50 2024-02-02T15:01:28.159-0500 7fe022856ec0 1 -- 192.168.245.130:0/4032087907 --> [v2:192.168.245.130:6800/2192218050,v1:192.168.245.130:6801/2192218050] -- osd_op(unknown.0.0:8 1.0 1:46e31d43:::zonegroup_info.b988c038-da81-4012-9de5-5cdac2211379:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e9) v8 -- 0x56221d51fd30 con 0x56221d517c50 2024-02-02T15:01:28.160-0500 7fe0194f16c0 1 -- 192.168.245.130:0/4032087907 <== osd.0 v2:192.168.245.130:6800/2192218050 8 ==== osd_op_reply(8 zonegroup_info.b988c038-da81-4012-9de5-5cdac2211379 [call out=48b,read 0~444 out=444b] v0'0 uv17 ondisk = 0) v8 ==== 237+0+492 (crc 0 0 0) 0x7fe0040083b0 con 0x56221d517c50
i updated https://github.com/ceph/ceph/pull/55406 with this fix
Updated by Casey Bodley 3 months ago
- Status changed from Fix Under Review to Resolved