Project

General

Profile

Actions

Support #55486

open

cephfs degraded during upgrade from 16.2.5 -> 16.2.6

Added by Jesse Roland almost 2 years ago. Updated almost 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

Hello everyone. I've tried upgrading my ceph cluster by a point release following instructions here: https://docs.ceph.com/en/latest/cephadm/upgrade/

Running

ceph orch upgrade
worked for most of the daemons, but has gotten stuck on the MDS servers. Here's how far it got after i eventually paused it
→ ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph@sha256:5755c3a5c197ef186b8186212e023565f15b799f1ed411207f2c3fcd4a80ab45",
    "in_progress": true,
    "services_complete": [
        "osd",
        "mgr",
        "mon" 
    ],
    "progress": "20/40 daemons upgraded",
    "message": "Upgrade paused" 

To investigate the issue, I dug deeper and found this error repeating with
ceph -W cephadm

2022-04-28T13:45:46.650511-0500 mgr.athos6.strdnf [INF] Upgrade: It is NOT safe to stop mds.cephfs.aramis3.uefzus at this time: one or more filesystems is currently degraded

Before upgrading, my cluster was reading `HEALTH_OK`, but now i'm seeing the following:

→ ceph -s
  cluster:
    id:     85361255-4989-4e27-bdb3-e017b9081911
    health: HEALTH_WARN
            1 filesystem is degraded
            1 filesystem has a failed mds daemon
 

with MDS reporting
    mds: 4/5 daemons up (1 failed), 2 standby

and data as

  data:
    volumes: 0/1 healthy, 1 failed
    pools:   12 pools, 377 pgs
    objects: 3.63M objects, 7.6 TiB
    usage:   23 TiB used, 23 TiB / 45 TiB avail
    pgs:     376 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   0 B/s rd, 1.9 MiB/s wr, 3 op/s rd, 177 op/s wr

  progress:
    Upgrade to 16.2.6 (26m)
      [=============...............] (remaining: 29m)

Looking at

ceph fs status
i'm seeing this:
 → ceph fs status
cephfs - 1 clients
======
RANK  STATE            MDS              ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.aramis3.uefzus  Reqs:    0 /s  1313k  1313k   185k     1   
 1    active   cephfs.athos5.nyvldi  Reqs:    0 /s  35.1k  34.6k  15.6k   391   
 2    active  cephfs.aramis6.nxuuix  Reqs:    0 /s   139k   139k  14.5k     1   
 3    active   cephfs.athos6.snzvao  Reqs:    0 /s  21.4k  21.4k  3106      7   
 4    failed                                                                    
      POOL         TYPE     USED  AVAIL  
cephfs_metadata  metadata  5333M  4677G  
  cephfs_data      data    3329G  4677G  
     STANDBY MDS       
 cephfs.athos4.vazlfc  
cephfs.aramis2.lhowjr  
                                    VERSION                                                                                         DAEMONS                                                      
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)  cephfs.aramis3.uefzus, cephfs.athos5.nyvldi, cephfs.aramis6.nxuuix, cephfs.athos6.snzvao, cephfs.athos4.vazlfc  
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)                                              cephfs.aramis2.lhowjr                                               

It appears only one daemon was upgraded, and has subsequently failed. I can't get it into the 4th rank, and the daemon itself isn't reporting any errors.

After some googling I found this documentation: https://docs.ceph.com/en/pacific/cephfs/upgrading/

This tells me to scale things down to

max_mds = 1
, but when I do so the MDS servers don't respond. No actions on the MDS seems to do anything. Any ideas? I'm completely paralyzed currently mid upgrade, and CephFS isn't responding to reconfigurations.

Actions #1

Updated by Neha Ojha almost 2 years ago

  • Project changed from Ceph to CephFS
Actions #2

Updated by Jesse Roland almost 2 years ago

I've managed to fix this, and am posting here to save anyone else from wasting as much time as I did.

After some substantial digging I stumbled across this: https://forum.proxmox.com/threads/ceph-16-2-6-cephfs-failed-after-upgrade-from-16-2-5.97742/

The secret sauce is in these commands

ceph fs set cephfs max_mds 1
ceph fs set cephfs allow_standby_replay false
ceph fs compat <fs name> add_incompat 7 "mds uses inline data" 

The

compat
command needed to be run, but first I had to take down the MDS cluster. To do that, I ran
ceph fs set cephfs joinable false
and then went through an manually failed each MDS with
ceph mds fail mds.#
. Once that was done, I was able to set the incompat flag.

Next up, I was stuck with all my MDS servers not joining the cluster. To fix this, I had to manually redeploy each MDS daemon to 16.2.6 like this

ceph orch daemon redeploy mds.cephfs.athos6.snzvao quay.io/ceph/ceph:v16.2.6

This needed to be done for the number of MDS servers that I had. Each redeployed MDS was able to rejoin the cluster, and once I had filled each rank, they started stopping themselves until there was only one MDS server like I had specified. From there, I ran

ceph orch upgrade resume
and the upgrade finished without any problems.

Note to developers, can we please add the

max_mds = 1
requirement to the official upgrade documentation? A lot of this may have been avoided if I had known to set that configuration before running the upgrade command

Actions #3

Updated by Venky Shankar almost 2 years ago

  • Assignee set to Venky Shankar
Actions #4

Updated by Venky Shankar almost 2 years ago

  • Status changed from New to In Progress

Hi Jesse,

Do you have the MDS logs when the file system was reported as damaged? cephadm does set the relevant configs you mention (max_mds, allow_standby_replay) and one does not have to set it manually. It could you that you ran into some bug that caused the MDS to fail, so debug logs would help here.

Cheers,
Venky

Actions #5

Updated by Jesse Roland almost 2 years ago

Venky Shankar wrote:

Hi Jesse,

Do you have the MDS logs when the file system was reported as damaged? cephadm does set the relevant configs you mention (max_mds, allow_standby_replay) and one does not have to set it manually. It could you that you ran into some bug that caused the MDS to fail, so debug logs would help here.

Cheers,
Venky

No sorry. During the next minor upgrade, I'll try upgrading without setting the flags, and see if it happens again. If it does i'll try to capture some logs and post them here.
Thanks!

Actions

Also available in: Atom PDF