Bug #59529: cluster upgrade stuck with OSDs and MDSs not upgraded. - Orchestrator - Ceph

Actions

Copy link

Bug #59529

open

cluster upgrade stuck with OSDs and MDSs not upgraded.

Added by Laura Flores 12 months ago. Updated 10 months ago.

Status:

Triaged

Priority:

Normal

Assignee:

Adam King

Category:

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2023-04-06_15:37:58-rados-wip-yuri3-testing-2023-04-04-0833-pacific-distro-default-smithi/7234310

2023-04-06T18:20:00.358 INFO:journalctl@ceph.mon.smithi043.smithi043.stdout:Apr 06 18:20:00 smithi043 ceph-f47f993e-d4a4-11ed-9aff-001a4aab830c-mon.smithi043[136038]: debug 2023-04-06T18:19:59.998+0000 7f028daa2700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR 1 filesystem with deprecated feature inline_data; 1 filesystem is offline; 1 filesystem is online with fewer MDS than max_mds

The job runs for a long time and eventually dies.

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Updated by Venky Shankar 12 months ago

Category set to Correctness/Safety
Status changed from New to Triaged
Assignee set to Venky Shankar
Target version set to v19.0.0
Backport changed from pacific to reef,quincy,pacific

Actions

Copy link

Updated by Laura Flores 12 months ago

Related to Bug #59530: mgr-nfs-upgrade: mds.foofs has 0/2 added

Actions

Copy link

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-26_20:20:05-rados-pacific-release-distro-default-smithi/7255328

Actions

Copy link

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-25_18:56:08-rados-wip-yuri5-testing-2023-04-25-0837-pacific-distro-default-smithi/7252383

Actions

Copy link

Updated by Laura Flores 12 months ago

Lots more in https://pulpito.ceph.com/yuriw-2023-05-06_14:41:44-rados-pacific-release-distro-default-smithi/

Actions

Copy link

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253764

Actions

Copy link

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-17_19:39:18-rados-wip-yuri5-testing-2023-05-09-1324-pacific-distro-default-smithi/7276764

Actions

Copy link

Updated by Venky Shankar 10 months ago

Project changed from CephFS to Orchestrator
Subject changed from mds_upgrade_sequence: overall HEALTH_ERR 1 filesystem with deprecated feature inline_data; 1 filesystem is offline; 1 filesystem is online with fewer MDS than max_mds to cluster upgrade stuck with OSDs and MDSs not upgraded.
Category deleted (~~Correctness/Safety~~)
Assignee changed from Venky Shankar to Adam King

Thanks for the bug report, Laura and apologies for the delay in looking into it (was on PTO for a while).

This seems like an stuck upgrade and not related to cephfs.

The workunit (fsstress) is along side cluster upgrade:

2023-05-17T20:13:18.836 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.1...
2023-05-17T20:13:18.837 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2023-05-17T20:13:18.838 DEBUG:teuthology.orchestra.run.smithi177:workunit test suites/fsstress.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d5dc66d8141bd85d12cfcadf3f36a6dfd7a09823 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 CEPH_MNT=/home/ubuntu/cephtest/mnt.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/fsstress.sh

...
...
...

2023-05-17T20:13:21.356 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.0...
2023-05-17T20:13:21.357 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2023-05-17T20:13:21.357 DEBUG:teuthology.orchestra.run.smithi072:workunit test suites/fsstress.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d5dc66d8141bd85d12cfcadf3f36a6dfd7a09823 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh

The workunit finishes

2023-05-17T20:16:02.736 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0...

...
...
...

2023-05-17T20:16:06.492 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.1...

While the cluster upgrade is still ongoing till max job timeout is hit:

2023-05-18T07:52:44.303 INFO:teuthology.orchestra.run.smithi072.stderr:2023-05-18T07:52:44.299+0000 7fc0ea0c7700  1 -- 172.21.15.72:0/3817295648 --> [v2:172.21.15.72:6800/1564434583,v1:172.21.15.72:6801/1564434583] -- mgr_command(tid 0: {"prefix": "orch ps", "target": [
"mon-mgr", ""]}) v1 -- 0x7fc0e41010d0 con 0x7fc0cc060c20
2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stderr:2023-05-18T07:52:44.308+0000 7fc0cbfff700  1 -- 172.21.15.72:0/3817295648 <== mgr.14590 v2:172.21.15.72:6800/1564434583 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+3388 (secure 0 0 0) 0x7fc0e4101
0d0 con 0x7fc0cc060c20
2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stdout:NAME                         HOST       PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID
2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stdout:alertmanager.smithi072       smithi072  *:9093,9094  running (11h)    11h ago  11h    24.3M        -  0.20.0                 0881eb8f169f  0d5497bd990d
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:crash.smithi072              smithi072               running (11h)    11h ago  11h    7298k        -  16.2.12-168-gd5dc66d8  2435984f4574  d7e40b28d101
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:crash.smithi177              smithi177               running (11h)    11h ago  11h    7306k        -  16.2.12-168-gd5dc66d8  2435984f4574  1c8c105c2019
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:grafana.smithi072            smithi072  *:3000       running (11h)    11h ago  11h    35.8M        -  6.7.4                  557c83e11646  509d233815b4
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi072.ljmwsc  smithi072               running (11h)    11h ago  11h     764M        -  16.2.4                 8d91d370c2b8  ddf349b60c1b
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi072.stbqfp  smithi072               running (11h)    11h ago  11h    15.1M        -  16.2.4                 8d91d370c2b8  10a6a6f5900a
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi177.gtiugp  smithi177               running (11h)    11h ago  11h    6259M        -  16.2.4                 8d91d370c2b8  c8d8f0b9c5d8
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi177.lumzel  smithi177               running (11h)    11h ago  11h    13.9M        -  16.2.4                 8d91d370c2b8  ae7657c554c8
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mgr.smithi072.wcffso         smithi072  *:8443,9283  running (11h)    11h ago  11h     439M        -  16.2.12-168-gd5dc66d8  2435984f4574  624f26d33589
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mgr.smithi177.twjwnu         smithi177  *:8443,9283  running (11h)    11h ago  11h     392M        -  16.2.12-168-gd5dc66d8  2435984f4574  fd16cfc9584c
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mon.smithi072                smithi072               running (11h)    11h ago  11h    44.4M    2048M  16.2.12-168-gd5dc66d8  2435984f4574  d8effd11e256
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mon.smithi177                smithi177               running (11h)    11h ago  11h    32.2M    2048M  16.2.12-168-gd5dc66d8  2435984f4574  e55be6f59c54
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:node-exporter.smithi072      smithi072  *:9100       running (11h)    11h ago  11h    18.3M        -  0.18.1                 e5a616e4b9cf  c4415650a7ab
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:node-exporter.smithi177      smithi177  *:9100       running (11h)    11h ago  11h    18.2M        -  0.18.1                 e5a616e4b9cf  34a2077f2cd4
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:osd.0                        smithi072               starting               -    -        -    4096M  <unknown>              <unknown>     <unknown>
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.1                        smithi072               running (11h)    11h ago  11h     729M    4096M  16.2.4                 8d91d370c2b8  19c574e54f52
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.2                        smithi072               running (11h)    11h ago  11h     619M    4096M  16.2.4                 8d91d370c2b8  00f7b71cb444
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.3                        smithi177               running (11h)    11h ago  11h     887M    4096M  16.2.4                 8d91d370c2b8  e65ba5fe80fa
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.4                        smithi177               running (11h)    11h ago  11h     744M    4096M  16.2.4                 8d91d370c2b8  2d1dc9dd66c0
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.5                        smithi177               running (11h)    11h ago  11h     713M    4096M  16.2.4                 8d91d370c2b8  1106cd5b3335
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:prometheus.smithi072         smithi072  *:9095       running (11h)    11h ago  11h    54.3M        -  2.18.1                 de242295e225  bc39b66dbede

As it can be seen not all daemons are upgraded - the upgrades for MDSs and OSDs aren't even started.

I'm moving this to orch component for @adking to have a look.

Actions

Copy link

Updated by Laura Flores 10 months ago

Thanks for taking a look Venky!

Based on the log snippet you shared, it might be a dupe or related to https://tracker.ceph.com/issues/59604.

Actions

Copy link

#10

Updated by Laura Flores 10 months ago

Related to Bug #59604: upgrade: unkown ceph version causes upgrade to get stuck added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #59529

cluster upgrade stuck with OSDs and MDSs not upgraded.

Updated by Venky Shankar 12 months ago

Updated by Laura Flores 12 months ago

Updated by Laura Flores 12 months ago

Updated by Laura Flores 12 months ago

Updated by Laura Flores 12 months ago

Updated by Laura Flores 12 months ago

Updated by Laura Flores 11 months ago

Updated by Venky Shankar 10 months ago

Updated by Laura Flores 10 months ago

Updated by Laura Flores 10 months ago