Bug #61889
closedcephadm: cephadm module crashes trying to migrate simple rgw specs
0%
Description
Specifically, those without a "spec" field. So this spec
service_type: rgw service_id: rgw.1 service_name: rgw.rgw.1 placement: label: rgw
would case the issue but this spec
service_type: rgw service_id: rgw.1 service_name: rgw.rgw.1 placement: label: rgw spec: rgw_frontend_port: 8000
would not.
The migration code assumes the "spec" field will always be present, and we get an unhandled KeyError that brings the module down if it isn't there.
This is pretty devastating for anyone trying to upgrade to a version with this migration present that has one of these simple rgw spec, as the module crashes silently and after that any orch commands fail
[ceph: root@vm-00 /]# ceph orch ps Error ENOENT: Module not found
and the upgrade is just stuck in this broken state.
This wasn't caught in testing as the rgw service used in our upgrade tests has the "spec" field present and the unit tests around this migration only tested with a spec that included the field as well.
Updated by Vikhyat Umrao 10 months ago
We hit this one at Perf&Scale - Workload Release criteria testing.
Updated by Adam King 10 months ago
Testing of proposed workaround. Starting from a 17.2.6 cluster with a single "simple" rgw spec.
ceph: root@vm-00 /]# ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.vm-00 vm-00 *:9093,9094 running (5m) 2m ago 7m 23.8M - 0.23.0 ba2b418f427c 569eb3e2b571 crash.vm-00 vm-00 running (7m) 2m ago 7m 7398k - 17.2.6 cc5b7b143311 10643e9a8823 crash.vm-01 vm-01 running (6m) 12s ago 6m 7428k - 17.2.6 cc5b7b143311 c434068e1c61 crash.vm-02 vm-02 running (6m) 12s ago 6m 7428k - 17.2.6 cc5b7b143311 f8cf2fb899be grafana.vm-00 vm-00 *:3000 running (5m) 2m ago 7m 50.3M - 8.3.5 dad864ee21e9 b6b6e08fea56 mgr.vm-00.hwowhq vm-00 *:9283 running (8m) 2m ago 8m 475M - 17.2.6 cc5b7b143311 d61a4186386e mgr.vm-02.yyfezr vm-02 *:8443,9283 running (6m) 12s ago 6m 425M - 17.2.6 cc5b7b143311 bf4f2ca11041 mon.vm-00 vm-00 running (8m) 2m ago 8m 58.2M 2048M 17.2.6 cc5b7b143311 c8ad2cf5b82c mon.vm-01 vm-01 running (6m) 12s ago 6m 37.2M 2048M 17.2.6 cc5b7b143311 3c241fd0bb7c mon.vm-02 vm-02 running (6m) 12s ago 6m 31.5M 2048M 17.2.6 cc5b7b143311 3da331bfd277 node-exporter.vm-00 vm-00 *:9100 running (7m) 2m ago 7m 20.4M - 1.3.1 1dbe0e931976 7185cf749433 node-exporter.vm-01 vm-01 *:9100 running (5m) 12s ago 5m 21.1M - 1.3.1 1dbe0e931976 e3ae1d374c62 node-exporter.vm-02 vm-02 *:9100 running (5m) 12s ago 5m 19.7M - 1.3.1 1dbe0e931976 f8ff4e2cbccd osd.0 vm-01 running (4m) 12s ago 4m 70.5M 12.1G 17.2.6 cc5b7b143311 43b4e12598c0 osd.1 vm-00 running (4m) 2m ago 4m 67.8M 9847M 17.2.6 cc5b7b143311 6137ed092707 osd.2 vm-02 running (4m) 12s ago 4m 67.4M 10.6G 17.2.6 cc5b7b143311 dda11971f4f7 osd.3 vm-00 running (4m) 2m ago 4m 70.6M 9847M 17.2.6 cc5b7b143311 a72259b3f1e6 osd.4 vm-01 running (3m) 12s ago 3m 73.8M 12.1G 17.2.6 cc5b7b143311 ee045846d17b osd.5 vm-02 running (4m) 12s ago 4m 72.1M 10.6G 17.2.6 cc5b7b143311 d01ba357f402 prometheus.vm-01 vm-01 *:9095 running (5m) 12s ago 5m 61.8M - 2.33.4 514e6a882f6e 705f64078890 rgw.foo.vm-01.tuessq vm-01 *:80 running (17s) 12s ago 16s 19.3M - 17.2.6 cc5b7b143311 930b911d0a48 rgw.foo.vm-02.bezobe vm-02 *:80 running (15s) 12s ago 15s 16.9M - 17.2.6 cc5b7b143311 3dea74fef129 [ceph: root@vm-00 /]# ceph orch ls rgw NAME PORTS RUNNING REFRESHED AGE PLACEMENT rgw.foo ?:80 2/2 16s ago 22s count:2 [ceph: root@vm-00 /]# ceph orch ls rgw --export service_type: rgw service_id: foo service_name: rgw.foo placement: count: 2
As you can see from the output of "ceph orch ls rgw --export", the rgw service has no "spec" section. We can add that in without changing how the actual service behaves at all by being explicit about the frontend type. In this case, we were using "beast" (the default)
[ceph: root@vm-00 /]# ceph config dump | grep rgw client.rgw.foo.vm-01.tuessq basic rgw_frontends beast port=80 * client.rgw.foo.vm-02.bezobe basic rgw_frontends beast port=80
so we can edit the rgw service spec to specify the "rgw_frontend_type" to be "beast"
[ceph: root@vm-00 /]# ceph orch ls rgw --export > rgw.yaml [ceph: root@vm-00 /]# vi rgw.yaml [ceph: root@vm-00 /]# cat rgw.yaml service_type: rgw service_id: foo service_name: rgw.foo placement: count: 2 spec: rgw_frontend_type: beast [ceph: root@vm-00 /]# ceph orch apply -i rgw.yaml Scheduled rgw.foo update...
And if we now run "ceph orch ls rgw --export"
[ceph: root@vm-00 /]# ceph orch ls rgw --export service_type: rgw service_id: foo service_name: rgw.foo placement: count: 2 spec: rgw_frontend_type: beast
we can see that there is now a "spec" section listed that includes the new setting we provided "rgw_frontend_type: beast". Now we should be able to upgrade fine.
[ceph: root@vm-00 /]# ceph orch upgrade start quay.io/ceph/ceph:v18.1.2 Initiating upgrade to quay.io/ceph/ceph:v18.1.2 [ceph: root@vm-00 /]# ceph orch upgrade status { "target_image": "quay.io/ceph/ceph:v18.1.2", "in_progress": true, "which": "Upgrading all daemon types on all hosts", "services_complete": [], "progress": "", "message": "Doing first pull of quay.io/ceph/ceph:v18.1.2 image", "is_paused": false }
After a while, the upgrade completed with no issue
[ceph: root@vm-00 /]# ceph orch upgrade status { "target_image": null, "in_progress": false, "which": "<unknown>", "services_complete": [], "progress": null, "message": "", "is_paused": false } [ceph: root@vm-00 /]# ceph versions { "mon": { "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 3 }, "mgr": { "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 2 }, "osd": { "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 6 }, "rgw": { "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 2 }, "overall": { "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 13 } }
So the workaround was successful
Updated by Backport Bot 10 months ago
- Copied to Backport #61938: reef: cephadm: cephadm module crashes trying to migrate simple rgw specs added
Updated by Backport Bot 10 months ago
- Copied to Backport #61939: quincy: cephadm: cephadm module crashes trying to migrate simple rgw specs added
Updated by Sayalee Raut 10 months ago
Upgrade using the workaround suggested[i.e. adding "spec" field to RGW spec file prior to upgrade] was successful.
Upgraded RHCS 6.1 (17.2.6-70) to Reef (18.0.0-4822-g1811b69f).
Updated by Adam King about 2 months ago
- Status changed from Pending Backport to Resolved