Actions
Support #55486
opencephfs degraded during upgrade from 16.2.5 -> 16.2.6
% Done:
0%
Component(FS):
Labels (FS):
Pull request ID:
Description
Hello everyone. I've tried upgrading my ceph cluster by a point release following instructions here: https://docs.ceph.com/en/latest/cephadm/upgrade/
Running
ceph orch upgradeworked for most of the daemons, but has gotten stuck on the MDS servers. Here's how far it got after i eventually paused it
→ ceph orch upgrade status { "target_image": "quay.io/ceph/ceph@sha256:5755c3a5c197ef186b8186212e023565f15b799f1ed411207f2c3fcd4a80ab45", "in_progress": true, "services_complete": [ "osd", "mgr", "mon" ], "progress": "20/40 daemons upgraded", "message": "Upgrade paused"
To investigate the issue, I dug deeper and found this error repeating with
ceph -W cephadm
2022-04-28T13:45:46.650511-0500 mgr.athos6.strdnf [INF] Upgrade: It is NOT safe to stop mds.cephfs.aramis3.uefzus at this time: one or more filesystems is currently degraded
Before upgrading, my cluster was reading `HEALTH_OK`, but now i'm seeing the following:
→ ceph -s cluster: id: 85361255-4989-4e27-bdb3-e017b9081911 health: HEALTH_WARN 1 filesystem is degraded 1 filesystem has a failed mds daemon
with MDS reporting
mds: 4/5 daemons up (1 failed), 2 standby
and data as
data: volumes: 0/1 healthy, 1 failed pools: 12 pools, 377 pgs objects: 3.63M objects, 7.6 TiB usage: 23 TiB used, 23 TiB / 45 TiB avail pgs: 376 active+clean 1 active+clean+scrubbing+deep io: client: 0 B/s rd, 1.9 MiB/s wr, 3 op/s rd, 177 op/s wr progress: Upgrade to 16.2.6 (26m) [=============...............] (remaining: 29m)
Looking at
ceph fs statusi'm seeing this:
→ ceph fs status cephfs - 1 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active cephfs.aramis3.uefzus Reqs: 0 /s 1313k 1313k 185k 1 1 active cephfs.athos5.nyvldi Reqs: 0 /s 35.1k 34.6k 15.6k 391 2 active cephfs.aramis6.nxuuix Reqs: 0 /s 139k 139k 14.5k 1 3 active cephfs.athos6.snzvao Reqs: 0 /s 21.4k 21.4k 3106 7 4 failed POOL TYPE USED AVAIL cephfs_metadata metadata 5333M 4677G cephfs_data data 3329G 4677G STANDBY MDS cephfs.athos4.vazlfc cephfs.aramis2.lhowjr VERSION DAEMONS ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable) cephfs.aramis3.uefzus, cephfs.athos5.nyvldi, cephfs.aramis6.nxuuix, cephfs.athos6.snzvao, cephfs.athos4.vazlfc ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) cephfs.aramis2.lhowjr
It appears only one daemon was upgraded, and has subsequently failed. I can't get it into the 4th rank, and the daemon itself isn't reporting any errors.
After some googling I found this documentation: https://docs.ceph.com/en/pacific/cephfs/upgrading/
This tells me to scale things down to
max_mds = 1, but when I do so the MDS servers don't respond. No actions on the MDS seems to do anything. Any ideas? I'm completely paralyzed currently mid upgrade, and CephFS isn't responding to reconfigurations.
Actions