Bug #61889: cephadm: cephadm module crashes trying to migrate simple rgw specs - Orchestrator - Ceph

Actions

Copy link

Bug #61889

closed

cephadm: cephadm module crashes trying to migrate simple rgw specs

Added by Adam King 10 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Adam King

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

reef, quincy

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

52301

Crash signature (v1):

Crash signature (v2):

Description

Specifically, those without a "spec" field. So this spec

service_type: rgw
service_id: rgw.1
service_name: rgw.rgw.1
placement:
  label: rgw

would case the issue but this spec

service_type: rgw
service_id: rgw.1
service_name: rgw.rgw.1
placement:
  label: rgw
spec:
  rgw_frontend_port: 8000

would not.

The migration code assumes the "spec" field will always be present, and we get an unhandled KeyError that brings the module down if it isn't there.

This is pretty devastating for anyone trying to upgrade to a version with this migration present that has one of these simple rgw spec, as the module crashes silently and after that any orch commands fail

[ceph: root@vm-00 /]# ceph orch ps
Error ENOENT: Module not found

and the upgrade is just stuck in this broken state.

This wasn't caught in testing as the rgw service used in our upgrade tests has the "spec" field present and the unit tests around this migration only tested with a spec that included the field as well.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Adam King 10 months ago

Pull request ID set to 52301

Actions

Copy link

Updated by Vikhyat Umrao 10 months ago

We hit this one at Perf&Scale - Workload Release criteria testing.

Actions

Copy link

Updated by Adam King 10 months ago

Testing of proposed workaround. Starting from a 17.2.6 cluster with a single "simple" rgw spec.

ceph: root@vm-00 /]# ceph orch ps   
NAME                  HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
alertmanager.vm-00    vm-00  *:9093,9094  running (5m)      2m ago   7m    23.8M        -  0.23.0   ba2b418f427c  569eb3e2b571  
crash.vm-00           vm-00               running (7m)      2m ago   7m    7398k        -  17.2.6   cc5b7b143311  10643e9a8823  
crash.vm-01           vm-01               running (6m)     12s ago   6m    7428k        -  17.2.6   cc5b7b143311  c434068e1c61  
crash.vm-02           vm-02               running (6m)     12s ago   6m    7428k        -  17.2.6   cc5b7b143311  f8cf2fb899be  
grafana.vm-00         vm-00  *:3000       running (5m)      2m ago   7m    50.3M        -  8.3.5    dad864ee21e9  b6b6e08fea56  
mgr.vm-00.hwowhq      vm-00  *:9283       running (8m)      2m ago   8m     475M        -  17.2.6   cc5b7b143311  d61a4186386e  
mgr.vm-02.yyfezr      vm-02  *:8443,9283  running (6m)     12s ago   6m     425M        -  17.2.6   cc5b7b143311  bf4f2ca11041  
mon.vm-00             vm-00               running (8m)      2m ago   8m    58.2M    2048M  17.2.6   cc5b7b143311  c8ad2cf5b82c  
mon.vm-01             vm-01               running (6m)     12s ago   6m    37.2M    2048M  17.2.6   cc5b7b143311  3c241fd0bb7c  
mon.vm-02             vm-02               running (6m)     12s ago   6m    31.5M    2048M  17.2.6   cc5b7b143311  3da331bfd277  
node-exporter.vm-00   vm-00  *:9100       running (7m)      2m ago   7m    20.4M        -  1.3.1    1dbe0e931976  7185cf749433  
node-exporter.vm-01   vm-01  *:9100       running (5m)     12s ago   5m    21.1M        -  1.3.1    1dbe0e931976  e3ae1d374c62  
node-exporter.vm-02   vm-02  *:9100       running (5m)     12s ago   5m    19.7M        -  1.3.1    1dbe0e931976  f8ff4e2cbccd  
osd.0                 vm-01               running (4m)     12s ago   4m    70.5M    12.1G  17.2.6   cc5b7b143311  43b4e12598c0  
osd.1                 vm-00               running (4m)      2m ago   4m    67.8M    9847M  17.2.6   cc5b7b143311  6137ed092707  
osd.2                 vm-02               running (4m)     12s ago   4m    67.4M    10.6G  17.2.6   cc5b7b143311  dda11971f4f7  
osd.3                 vm-00               running (4m)      2m ago   4m    70.6M    9847M  17.2.6   cc5b7b143311  a72259b3f1e6  
osd.4                 vm-01               running (3m)     12s ago   3m    73.8M    12.1G  17.2.6   cc5b7b143311  ee045846d17b  
osd.5                 vm-02               running (4m)     12s ago   4m    72.1M    10.6G  17.2.6   cc5b7b143311  d01ba357f402  
prometheus.vm-01      vm-01  *:9095       running (5m)     12s ago   5m    61.8M        -  2.33.4   514e6a882f6e  705f64078890  
rgw.foo.vm-01.tuessq  vm-01  *:80         running (17s)    12s ago  16s    19.3M        -  17.2.6   cc5b7b143311  930b911d0a48  
rgw.foo.vm-02.bezobe  vm-02  *:80         running (15s)    12s ago  15s    16.9M        -  17.2.6   cc5b7b143311  3dea74fef129  

[ceph: root@vm-00 /]# ceph orch ls rgw
NAME     PORTS  RUNNING  REFRESHED  AGE  PLACEMENT  
rgw.foo  ?:80       2/2  16s ago    22s  count:2    

[ceph: root@vm-00 /]# ceph orch ls rgw --export
service_type: rgw
service_id: foo
service_name: rgw.foo
placement:
  count: 2

As you can see from the output of "ceph orch ls rgw --export", the rgw service has no "spec" section. We can add that in without changing how the actual service behaves at all by being explicit about the frontend type. In this case, we were using "beast" (the default)

[ceph: root@vm-00 /]# ceph config dump | grep rgw                                                                       
client.rgw.foo.vm-01.tuessq              basic     rgw_frontends                          beast port=80                                                                              * 
client.rgw.foo.vm-02.bezobe              basic     rgw_frontends                          beast port=80

so we can edit the rgw service spec to specify the "rgw_frontend_type" to be "beast"

[ceph: root@vm-00 /]# ceph orch ls rgw --export > rgw.yaml

[ceph: root@vm-00 /]# vi rgw.yaml 

[ceph: root@vm-00 /]# cat rgw.yaml 
service_type: rgw
service_id: foo
service_name: rgw.foo
placement:
  count: 2
spec:
  rgw_frontend_type: beast

[ceph: root@vm-00 /]# ceph orch apply -i rgw.yaml 
Scheduled rgw.foo update...

And if we now run "ceph orch ls rgw --export"

[ceph: root@vm-00 /]# ceph orch ls rgw --export
service_type: rgw
service_id: foo
service_name: rgw.foo
placement:
  count: 2
spec:
  rgw_frontend_type: beast

we can see that there is now a "spec" section listed that includes the new setting we provided "rgw_frontend_type: beast". Now we should be able to upgrade fine.

[ceph: root@vm-00 /]# ceph orch upgrade start quay.io/ceph/ceph:v18.1.2
Initiating upgrade to quay.io/ceph/ceph:v18.1.2

[ceph: root@vm-00 /]# ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph:v18.1.2",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "Doing first pull of quay.io/ceph/ceph:v18.1.2 image",
    "is_paused": false
}

After a while, the upgrade completed with no issue

[ceph: root@vm-00 /]# ceph orch upgrade status
{
    "target_image": null,
    "in_progress": false,
    "which": "<unknown>",
    "services_complete": [],
    "progress": null,
    "message": "",
    "is_paused": false
}

[ceph: root@vm-00 /]# ceph versions
{
    "mon": {
        "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 3
    },
    "mgr": {
        "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 2
    },
    "osd": {
        "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 6
    },
    "rgw": {
        "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 2
    },
    "overall": {
        "ceph version 18.1.2 (a5c951305c2409669162c235d81981bdc60dd9e7) reef (rc)": 13
    }
}

So the workaround was successful

Actions

Copy link