Project

General

Profile

Actions

Bug #52134

closed

botched cephadm upgrade due to mds failures

Added by Jeff Layton over 2 years ago. Updated almost 2 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I tried to upgrade my cephadm cluster from one (development) quincy-ish build to another. It got about halfway through and then stalled:

{
    "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:181f5987a0cce563a800d85d7c0b4e3fafc77e43bbff619b02de2cc674b6cd8e",
    "in_progress": true,
    "services_complete": [
        "osd",
        "mon",
        "mgr",
        "crash" 
    ],
    "progress": "14/28 ceph daemons upgraded",
    "message": "Currently upgrading mds daemons" 
}

[ceph: root@cephadm1 /]# ceph -s
  cluster:
    id:     14753b5a-f1f2-11eb-ac35-52540031ba78
    health: HEALTH_ERR
            2 filesystems are degraded
            2 filesystems have a failed mds daemon
            2 filesystems are offline

  services:
    mon: 3 daemons, quorum cephadm1,cephadm2,cephadm3 (age 7m)
    mgr: cephadm1.bhovmw(active, since 7m), standbys: cephadm2.blwjot
    mds: 0/6 daemons up (6 failed), 8 standby
    osd: 3 osds: 3 up (since 7m), 3 in (since 7m)

  data:
    volumes: 0/2 healthy, 2 failed
    pools:   6 pools, 240 pgs
    objects: 660 objects, 1.6 GiB
    usage:   7.4 GiB used, 2.7 TiB / 2.7 TiB avail
    pgs:     240 active+clean

Sage says the issue is that the mons didn't make the standby mds's join the fs's:

[ceph: root@cephadm1 /]# ceph fs dump
e875
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'test' (1)
fs_name    test
epoch    873
flags    12 joinable allow_snaps allow_multimds_snaps
created    2021-07-31T11:56:27.214094+0000
modified    2021-08-11T14:39:10.237226+0000
tableserver    0
root    0
session_timeout    60
session_autoclose    300
max_file_size    1099511627776
required_client_features    {}
last_failure    0
last_failure_osd_epoch    811
compat    compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds    3
in    0,1,2
up    {}
failed    0,1,2
damaged    
stopped    
data_pools    [3]
metadata_pool    2
inline_data    disabled
balancer    
standby_count_wanted    1

Filesystem 'scratch' (2)
fs_name    scratch
epoch    872
flags    12 joinable allow_snaps allow_multimds_snaps
created    2021-07-31T11:56:32.410538+0000
modified    2021-08-11T14:39:08.876037+0000
tableserver    0
root    0
session_timeout    60
session_autoclose    300
max_file_size    1099511627776
required_client_features    {}
last_failure    0
last_failure_osd_epoch    804
compat    compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds    3
in    0,1,2
up    {}
failed    0,1,2
damaged    
stopped    
data_pools    [5]
metadata_pool    4
inline_data    disabled
balancer    
standby_count_wanted    1

Standby daemons:

[mds.scratch.cephadm2.hwzlvz{-1:184105} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.82:6800/2744434173,v1:192.168.1.82:6801/2744434173] compat {c=[1],r=[1],i=[1]}]
[mds.test.cephadm3.jttkng{ffffffff:2cf2a} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.83:1a90/c0496dd,v1:192.168.1.83:1a91/c0496dd] compat {c=[1],r=[1],i=[1]}]
[mds.test.cephadm2.ynwuib{ffffffff:2cf2c} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.82:1a92/de17d7a,v1:192.168.1.82:1a93/de17d7a] compat {c=[1],r=[1],i=[1]}]
[mds.scratch.cephadm3.cvahkj{ffffffff:2cf2d} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.83:1a92/db28153,v1:192.168.1.83:1a93/db28153] compat {c=[1],r=[1],i=[1]}]
[mds.scratch.cephadm1.ciszkl{ffffffff:2cf3d} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.81:1a92/c9319f5f,v1:192.168.1.81:1a93/c9319f5f] compat {c=[1],r=[1],i=[7ff]}]
[mds.scratch.cephadm1.qiaeri{ffffffff:2cf43} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.81:1a94/abf2f0ac,v1:192.168.1.81:1a95/abf2f0ac] compat {c=[1],r=[1],i=[7ff]}]
[mds.test.cephadm1.aplsnh{ffffffff:2cf47} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.81:1a90/e676cca3,v1:192.168.1.81:1a91/e676cca3] compat {c=[1],r=[1],i=[1]}]
[mds.test.cephadm1.podojo{ffffffff:2cf51} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.81:1a9e/78b58683,v1:192.168.1.81:1a9f/78b58683] compat {c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 875

I'm blowing away the cluster now as I need the hardware, but there does seem to be an issue there. Also attaching the cephadm_status script output.


Files

cephadm_status.out (45.2 KB) cephadm_status.out Jeff Layton, 08/11/2021 03:31 PM
Actions #1

Updated by Jeff Layton over 2 years ago

[ceph: root@cephadm1 /]# ceph versions
{
    "mon": {
        "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 3
    },
    "mgr": {
        "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 2
    },
    "osd": {
        "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 3
    },
    "mds": {
        "ceph version 17.0.0-6543-g7f522532 (7f5225325376bebf75aecc8551248ba913628577) quincy (dev)": 5,
        "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 3
    },
    "overall": {
        "ceph version 17.0.0-6543-g7f522532 (7f5225325376bebf75aecc8551248ba913628577) quincy (dev)": 5,
        "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 11
    }
}
Actions #2

Updated by Patrick Donnelly over 2 years ago

  • Assignee set to Jeff Layton
Actions #3

Updated by Patrick Donnelly over 2 years ago

If you hit this again, increase debugging on mons to debug_mon=20 and let it chew for 30s-1m so we can hopefully see why it's not promoting the standbys.

Actions #4

Updated by Jeff Layton almost 2 years ago

  • Status changed from New to Can't reproduce

Haven't seen this in some time.

Actions

Also available in: Atom PDF