Bug #59529
opencluster upgrade stuck with OSDs and MDSs not upgraded.
0%
Description
/a/yuriw-2023-04-06_15:37:58-rados-wip-yuri3-testing-2023-04-04-0833-pacific-distro-default-smithi/7234310
2023-04-06T18:20:00.358 INFO:journalctl@ceph.mon.smithi043.smithi043.stdout:Apr 06 18:20:00 smithi043 ceph-f47f993e-d4a4-11ed-9aff-001a4aab830c-mon.smithi043[136038]: debug 2023-04-06T18:19:59.998+0000 7f028daa2700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR 1 filesystem with deprecated feature inline_data; 1 filesystem is offline; 1 filesystem is online with fewer MDS than max_mds
The job runs for a long time and eventually dies.
Updated by Venky Shankar 12 months ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Venky Shankar
- Target version set to v19.0.0
- Backport changed from pacific to reef,quincy,pacific
Updated by Laura Flores 12 months ago
- Related to Bug #59530: mgr-nfs-upgrade: mds.foofs has 0/2 added
Updated by Laura Flores 12 months ago
/a/yuriw-2023-04-26_20:20:05-rados-pacific-release-distro-default-smithi/7255328
Updated by Laura Flores 12 months ago
/a/yuriw-2023-04-25_18:56:08-rados-wip-yuri5-testing-2023-04-25-0837-pacific-distro-default-smithi/7252383
Updated by Laura Flores 12 months ago
Updated by Laura Flores 12 months ago
/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253764
Updated by Laura Flores 11 months ago
/a/yuriw-2023-05-17_19:39:18-rados-wip-yuri5-testing-2023-05-09-1324-pacific-distro-default-smithi/7276764
Updated by Venky Shankar 10 months ago
- Project changed from CephFS to Orchestrator
- Subject changed from mds_upgrade_sequence: overall HEALTH_ERR 1 filesystem with deprecated feature inline_data; 1 filesystem is offline; 1 filesystem is online with fewer MDS than max_mds to cluster upgrade stuck with OSDs and MDSs not upgraded.
- Category deleted (
Correctness/Safety) - Assignee changed from Venky Shankar to Adam King
Thanks for the bug report, Laura and apologies for the delay in looking into it (was on PTO for a while).
This seems like an stuck upgrade and not related to cephfs.
The workunit (fsstress) is along side cluster upgrade:
2023-05-17T20:13:18.836 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.1... 2023-05-17T20:13:18.837 INFO:tasks.workunit:Running workunit suites/fsstress.sh... 2023-05-17T20:13:18.838 DEBUG:teuthology.orchestra.run.smithi177:workunit test suites/fsstress.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d5dc66d8141bd85d12cfcadf3f36a6dfd7a09823 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 CEPH_MNT=/home/ubuntu/cephtest/mnt.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/fsstress.sh ... ... ... 2023-05-17T20:13:21.356 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.0... 2023-05-17T20:13:21.357 INFO:tasks.workunit:Running workunit suites/fsstress.sh... 2023-05-17T20:13:21.357 DEBUG:teuthology.orchestra.run.smithi072:workunit test suites/fsstress.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d5dc66d8141bd85d12cfcadf3f36a6dfd7a09823 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh
The workunit finishes
2023-05-17T20:16:02.736 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0... ... ... ... 2023-05-17T20:16:06.492 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.1...
While the cluster upgrade is still ongoing till max job timeout is hit:
2023-05-18T07:52:44.303 INFO:teuthology.orchestra.run.smithi072.stderr:2023-05-18T07:52:44.299+0000 7fc0ea0c7700 1 -- 172.21.15.72:0/3817295648 --> [v2:172.21.15.72:6800/1564434583,v1:172.21.15.72:6801/1564434583] -- mgr_command(tid 0: {"prefix": "orch ps", "target": [ "mon-mgr", ""]}) v1 -- 0x7fc0e41010d0 con 0x7fc0cc060c20 2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stderr:2023-05-18T07:52:44.308+0000 7fc0cbfff700 1 -- 172.21.15.72:0/3817295648 <== mgr.14590 v2:172.21.15.72:6800/1564434583 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+3388 (secure 0 0 0) 0x7fc0e4101 0d0 con 0x7fc0cc060c20 2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stdout:NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID 2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stdout:alertmanager.smithi072 smithi072 *:9093,9094 running (11h) 11h ago 11h 24.3M - 0.20.0 0881eb8f169f 0d5497bd990d 2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:crash.smithi072 smithi072 running (11h) 11h ago 11h 7298k - 16.2.12-168-gd5dc66d8 2435984f4574 d7e40b28d101 2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:crash.smithi177 smithi177 running (11h) 11h ago 11h 7306k - 16.2.12-168-gd5dc66d8 2435984f4574 1c8c105c2019 2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:grafana.smithi072 smithi072 *:3000 running (11h) 11h ago 11h 35.8M - 6.7.4 557c83e11646 509d233815b4 2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi072.ljmwsc smithi072 running (11h) 11h ago 11h 764M - 16.2.4 8d91d370c2b8 ddf349b60c1b 2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi072.stbqfp smithi072 running (11h) 11h ago 11h 15.1M - 16.2.4 8d91d370c2b8 10a6a6f5900a 2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi177.gtiugp smithi177 running (11h) 11h ago 11h 6259M - 16.2.4 8d91d370c2b8 c8d8f0b9c5d8 2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi177.lumzel smithi177 running (11h) 11h ago 11h 13.9M - 16.2.4 8d91d370c2b8 ae7657c554c8 2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mgr.smithi072.wcffso smithi072 *:8443,9283 running (11h) 11h ago 11h 439M - 16.2.12-168-gd5dc66d8 2435984f4574 624f26d33589 2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mgr.smithi177.twjwnu smithi177 *:8443,9283 running (11h) 11h ago 11h 392M - 16.2.12-168-gd5dc66d8 2435984f4574 fd16cfc9584c 2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mon.smithi072 smithi072 running (11h) 11h ago 11h 44.4M 2048M 16.2.12-168-gd5dc66d8 2435984f4574 d8effd11e256 2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mon.smithi177 smithi177 running (11h) 11h ago 11h 32.2M 2048M 16.2.12-168-gd5dc66d8 2435984f4574 e55be6f59c54 2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:node-exporter.smithi072 smithi072 *:9100 running (11h) 11h ago 11h 18.3M - 0.18.1 e5a616e4b9cf c4415650a7ab 2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:node-exporter.smithi177 smithi177 *:9100 running (11h) 11h ago 11h 18.2M - 0.18.1 e5a616e4b9cf 34a2077f2cd4 2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:osd.0 smithi072 starting - - - 4096M <unknown> <unknown> <unknown> 2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.1 smithi072 running (11h) 11h ago 11h 729M 4096M 16.2.4 8d91d370c2b8 19c574e54f52 2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.2 smithi072 running (11h) 11h ago 11h 619M 4096M 16.2.4 8d91d370c2b8 00f7b71cb444 2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.3 smithi177 running (11h) 11h ago 11h 887M 4096M 16.2.4 8d91d370c2b8 e65ba5fe80fa 2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.4 smithi177 running (11h) 11h ago 11h 744M 4096M 16.2.4 8d91d370c2b8 2d1dc9dd66c0 2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.5 smithi177 running (11h) 11h ago 11h 713M 4096M 16.2.4 8d91d370c2b8 1106cd5b3335 2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:prometheus.smithi072 smithi072 *:9095 running (11h) 11h ago 11h 54.3M - 2.18.1 de242295e225 bc39b66dbede
As it can be seen not all daemons are upgraded - the upgrades for MDSs and OSDs aren't even started.
I'm moving this to orch component for @adking to have a look.
Updated by Laura Flores 10 months ago
Thanks for taking a look Venky!
Based on the log snippet you shared, it might be a dupe or related to https://tracker.ceph.com/issues/59604.
Updated by Laura Flores 10 months ago
- Related to Bug #59604: upgrade: unkown ceph version causes upgrade to get stuck added