Project

General

Profile

Actions

Bug #59529

open

cluster upgrade stuck with OSDs and MDSs not upgraded.

Added by Laura Flores 12 months ago. Updated 10 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2023-04-06_15:37:58-rados-wip-yuri3-testing-2023-04-04-0833-pacific-distro-default-smithi/7234310

2023-04-06T18:20:00.358 INFO:journalctl@ceph.mon.smithi043.smithi043.stdout:Apr 06 18:20:00 smithi043 ceph-f47f993e-d4a4-11ed-9aff-001a4aab830c-mon.smithi043[136038]: debug 2023-04-06T18:19:59.998+0000 7f028daa2700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR 1 filesystem with deprecated feature inline_data; 1 filesystem is offline; 1 filesystem is online with fewer MDS than max_mds

The job runs for a long time and eventually dies.


Related issues 2 (2 open0 closed)

Related to CephFS - Bug #59530: mgr-nfs-upgrade: mds.foofs has 0/2TriagedVenky Shankar

Actions
Related to Orchestrator - Bug #59604: upgrade: unkown ceph version causes upgrade to get stuckNewAdam King

Actions
Actions #1

Updated by Venky Shankar 12 months ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Venky Shankar
  • Target version set to v19.0.0
  • Backport changed from pacific to reef,quincy,pacific
Actions #2

Updated by Laura Flores 12 months ago

  • Related to Bug #59530: mgr-nfs-upgrade: mds.foofs has 0/2 added
Actions #3

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-26_20:20:05-rados-pacific-release-distro-default-smithi/7255328

Actions #4

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-25_18:56:08-rados-wip-yuri5-testing-2023-04-25-0837-pacific-distro-default-smithi/7252383

Actions #6

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-26_01:16:19-rados-wip-yuri11-testing-2023-04-25-1605-pacific-distro-default-smithi/7253764

Actions #7

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-17_19:39:18-rados-wip-yuri5-testing-2023-05-09-1324-pacific-distro-default-smithi/7276764

Actions #8

Updated by Venky Shankar 10 months ago

  • Project changed from CephFS to Orchestrator
  • Subject changed from mds_upgrade_sequence: overall HEALTH_ERR 1 filesystem with deprecated feature inline_data; 1 filesystem is offline; 1 filesystem is online with fewer MDS than max_mds to cluster upgrade stuck with OSDs and MDSs not upgraded.
  • Category deleted (Correctness/Safety)
  • Assignee changed from Venky Shankar to Adam King

Thanks for the bug report, Laura and apologies for the delay in looking into it (was on PTO for a while).

This seems like an stuck upgrade and not related to cephfs.

The workunit (fsstress) is along side cluster upgrade:

2023-05-17T20:13:18.836 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.1...
2023-05-17T20:13:18.837 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2023-05-17T20:13:18.838 DEBUG:teuthology.orchestra.run.smithi177:workunit test suites/fsstress.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d5dc66d8141bd85d12cfcadf3f36a6dfd7a09823 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 CEPH_MNT=/home/ubuntu/cephtest/mnt.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/fsstress.sh

...
...
...

2023-05-17T20:13:21.356 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.0...
2023-05-17T20:13:21.357 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2023-05-17T20:13:21.357 DEBUG:teuthology.orchestra.run.smithi072:workunit test suites/fsstress.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d5dc66d8141bd85d12cfcadf3f36a6dfd7a09823 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh

The workunit finishes

2023-05-17T20:16:02.736 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0...

...
...
...

2023-05-17T20:16:06.492 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.1...

While the cluster upgrade is still ongoing till max job timeout is hit:

2023-05-18T07:52:44.303 INFO:teuthology.orchestra.run.smithi072.stderr:2023-05-18T07:52:44.299+0000 7fc0ea0c7700  1 -- 172.21.15.72:0/3817295648 --> [v2:172.21.15.72:6800/1564434583,v1:172.21.15.72:6801/1564434583] -- mgr_command(tid 0: {"prefix": "orch ps", "target": [
"mon-mgr", ""]}) v1 -- 0x7fc0e41010d0 con 0x7fc0cc060c20
2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stderr:2023-05-18T07:52:44.308+0000 7fc0cbfff700  1 -- 172.21.15.72:0/3817295648 <== mgr.14590 v2:172.21.15.72:6800/1564434583 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+3388 (secure 0 0 0) 0x7fc0e4101
0d0 con 0x7fc0cc060c20
2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stdout:NAME                         HOST       PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION                IMAGE ID      CONTAINER ID
2023-05-18T07:52:44.312 INFO:teuthology.orchestra.run.smithi072.stdout:alertmanager.smithi072       smithi072  *:9093,9094  running (11h)    11h ago  11h    24.3M        -  0.20.0                 0881eb8f169f  0d5497bd990d
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:crash.smithi072              smithi072               running (11h)    11h ago  11h    7298k        -  16.2.12-168-gd5dc66d8  2435984f4574  d7e40b28d101
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:crash.smithi177              smithi177               running (11h)    11h ago  11h    7306k        -  16.2.12-168-gd5dc66d8  2435984f4574  1c8c105c2019
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:grafana.smithi072            smithi072  *:3000       running (11h)    11h ago  11h    35.8M        -  6.7.4                  557c83e11646  509d233815b4
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi072.ljmwsc  smithi072               running (11h)    11h ago  11h     764M        -  16.2.4                 8d91d370c2b8  ddf349b60c1b
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi072.stbqfp  smithi072               running (11h)    11h ago  11h    15.1M        -  16.2.4                 8d91d370c2b8  10a6a6f5900a
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi177.gtiugp  smithi177               running (11h)    11h ago  11h    6259M        -  16.2.4                 8d91d370c2b8  c8d8f0b9c5d8
2023-05-18T07:52:44.313 INFO:teuthology.orchestra.run.smithi072.stdout:mds.cephfs.smithi177.lumzel  smithi177               running (11h)    11h ago  11h    13.9M        -  16.2.4                 8d91d370c2b8  ae7657c554c8
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mgr.smithi072.wcffso         smithi072  *:8443,9283  running (11h)    11h ago  11h     439M        -  16.2.12-168-gd5dc66d8  2435984f4574  624f26d33589
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mgr.smithi177.twjwnu         smithi177  *:8443,9283  running (11h)    11h ago  11h     392M        -  16.2.12-168-gd5dc66d8  2435984f4574  fd16cfc9584c
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mon.smithi072                smithi072               running (11h)    11h ago  11h    44.4M    2048M  16.2.12-168-gd5dc66d8  2435984f4574  d8effd11e256
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:mon.smithi177                smithi177               running (11h)    11h ago  11h    32.2M    2048M  16.2.12-168-gd5dc66d8  2435984f4574  e55be6f59c54
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:node-exporter.smithi072      smithi072  *:9100       running (11h)    11h ago  11h    18.3M        -  0.18.1                 e5a616e4b9cf  c4415650a7ab
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:node-exporter.smithi177      smithi177  *:9100       running (11h)    11h ago  11h    18.2M        -  0.18.1                 e5a616e4b9cf  34a2077f2cd4
2023-05-18T07:52:44.314 INFO:teuthology.orchestra.run.smithi072.stdout:osd.0                        smithi072               starting               -    -        -    4096M  <unknown>              <unknown>     <unknown>
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.1                        smithi072               running (11h)    11h ago  11h     729M    4096M  16.2.4                 8d91d370c2b8  19c574e54f52
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.2                        smithi072               running (11h)    11h ago  11h     619M    4096M  16.2.4                 8d91d370c2b8  00f7b71cb444
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.3                        smithi177               running (11h)    11h ago  11h     887M    4096M  16.2.4                 8d91d370c2b8  e65ba5fe80fa
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.4                        smithi177               running (11h)    11h ago  11h     744M    4096M  16.2.4                 8d91d370c2b8  2d1dc9dd66c0
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:osd.5                        smithi177               running (11h)    11h ago  11h     713M    4096M  16.2.4                 8d91d370c2b8  1106cd5b3335
2023-05-18T07:52:44.315 INFO:teuthology.orchestra.run.smithi072.stdout:prometheus.smithi072         smithi072  *:9095       running (11h)    11h ago  11h    54.3M        -  2.18.1                 de242295e225  bc39b66dbede

As it can be seen not all daemons are upgraded - the upgrades for MDSs and OSDs aren't even started.

I'm moving this to orch component for @adking to have a look.

Actions #9

Updated by Laura Flores 10 months ago

Thanks for taking a look Venky!

Based on the log snippet you shared, it might be a dupe or related to https://tracker.ceph.com/issues/59604.

Actions #10

Updated by Laura Flores 10 months ago

  • Related to Bug #59604: upgrade: unkown ceph version causes upgrade to get stuck added
Actions

Also available in: Atom PDF