Project

General

Profile

Actions

Support #55486

open

cephfs degraded during upgrade from 16.2.5 -> 16.2.6

Added by Jesse Roland about 2 years ago. Updated almost 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

Hello everyone. I've tried upgrading my ceph cluster by a point release following instructions here: https://docs.ceph.com/en/latest/cephadm/upgrade/

Running

ceph orch upgrade
worked for most of the daemons, but has gotten stuck on the MDS servers. Here's how far it got after i eventually paused it
→ ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph@sha256:5755c3a5c197ef186b8186212e023565f15b799f1ed411207f2c3fcd4a80ab45",
    "in_progress": true,
    "services_complete": [
        "osd",
        "mgr",
        "mon" 
    ],
    "progress": "20/40 daemons upgraded",
    "message": "Upgrade paused" 

To investigate the issue, I dug deeper and found this error repeating with
ceph -W cephadm

2022-04-28T13:45:46.650511-0500 mgr.athos6.strdnf [INF] Upgrade: It is NOT safe to stop mds.cephfs.aramis3.uefzus at this time: one or more filesystems is currently degraded

Before upgrading, my cluster was reading `HEALTH_OK`, but now i'm seeing the following:

→ ceph -s
  cluster:
    id:     85361255-4989-4e27-bdb3-e017b9081911
    health: HEALTH_WARN
            1 filesystem is degraded
            1 filesystem has a failed mds daemon
 

with MDS reporting
    mds: 4/5 daemons up (1 failed), 2 standby

and data as

  data:
    volumes: 0/1 healthy, 1 failed
    pools:   12 pools, 377 pgs
    objects: 3.63M objects, 7.6 TiB
    usage:   23 TiB used, 23 TiB / 45 TiB avail
    pgs:     376 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   0 B/s rd, 1.9 MiB/s wr, 3 op/s rd, 177 op/s wr

  progress:
    Upgrade to 16.2.6 (26m)
      [=============...............] (remaining: 29m)

Looking at

ceph fs status
i'm seeing this:
 → ceph fs status
cephfs - 1 clients
======
RANK  STATE            MDS              ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  cephfs.aramis3.uefzus  Reqs:    0 /s  1313k  1313k   185k     1   
 1    active   cephfs.athos5.nyvldi  Reqs:    0 /s  35.1k  34.6k  15.6k   391   
 2    active  cephfs.aramis6.nxuuix  Reqs:    0 /s   139k   139k  14.5k     1   
 3    active   cephfs.athos6.snzvao  Reqs:    0 /s  21.4k  21.4k  3106      7   
 4    failed                                                                    
      POOL         TYPE     USED  AVAIL  
cephfs_metadata  metadata  5333M  4677G  
  cephfs_data      data    3329G  4677G  
     STANDBY MDS       
 cephfs.athos4.vazlfc  
cephfs.aramis2.lhowjr  
                                    VERSION                                                                                         DAEMONS                                                      
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)  cephfs.aramis3.uefzus, cephfs.athos5.nyvldi, cephfs.aramis6.nxuuix, cephfs.athos6.snzvao, cephfs.athos4.vazlfc  
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)                                              cephfs.aramis2.lhowjr                                               

It appears only one daemon was upgraded, and has subsequently failed. I can't get it into the 4th rank, and the daemon itself isn't reporting any errors.

After some googling I found this documentation: https://docs.ceph.com/en/pacific/cephfs/upgrading/

This tells me to scale things down to

max_mds = 1
, but when I do so the MDS servers don't respond. No actions on the MDS seems to do anything. Any ideas? I'm completely paralyzed currently mid upgrade, and CephFS isn't responding to reconfigurations.

Actions

Also available in: Atom PDF