Actions
Bug #52134
closedbotched cephadm upgrade due to mds failures
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I tried to upgrade my cephadm cluster from one (development) quincy-ish build to another. It got about halfway through and then stalled:
{ "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:181f5987a0cce563a800d85d7c0b4e3fafc77e43bbff619b02de2cc674b6cd8e", "in_progress": true, "services_complete": [ "osd", "mon", "mgr", "crash" ], "progress": "14/28 ceph daemons upgraded", "message": "Currently upgrading mds daemons" } [ceph: root@cephadm1 /]# ceph -s cluster: id: 14753b5a-f1f2-11eb-ac35-52540031ba78 health: HEALTH_ERR 2 filesystems are degraded 2 filesystems have a failed mds daemon 2 filesystems are offline services: mon: 3 daemons, quorum cephadm1,cephadm2,cephadm3 (age 7m) mgr: cephadm1.bhovmw(active, since 7m), standbys: cephadm2.blwjot mds: 0/6 daemons up (6 failed), 8 standby osd: 3 osds: 3 up (since 7m), 3 in (since 7m) data: volumes: 0/2 healthy, 2 failed pools: 6 pools, 240 pgs objects: 660 objects, 1.6 GiB usage: 7.4 GiB used, 2.7 TiB / 2.7 TiB avail pgs: 240 active+clean
Sage says the issue is that the mons didn't make the standby mds's join the fs's:
[ceph: root@cephadm1 /]# ceph fs dump e875 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'test' (1) fs_name test epoch 873 flags 12 joinable allow_snaps allow_multimds_snaps created 2021-07-31T11:56:27.214094+0000 modified 2021-08-11T14:39:10.237226+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 811 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 3 in 0,1,2 up {} failed 0,1,2 damaged stopped data_pools [3] metadata_pool 2 inline_data disabled balancer standby_count_wanted 1 Filesystem 'scratch' (2) fs_name scratch epoch 872 flags 12 joinable allow_snaps allow_multimds_snaps created 2021-07-31T11:56:32.410538+0000 modified 2021-08-11T14:39:08.876037+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {} last_failure 0 last_failure_osd_epoch 804 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 3 in 0,1,2 up {} failed 0,1,2 damaged stopped data_pools [5] metadata_pool 4 inline_data disabled balancer standby_count_wanted 1 Standby daemons: [mds.scratch.cephadm2.hwzlvz{-1:184105} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.82:6800/2744434173,v1:192.168.1.82:6801/2744434173] compat {c=[1],r=[1],i=[1]}] [mds.test.cephadm3.jttkng{ffffffff:2cf2a} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.83:1a90/c0496dd,v1:192.168.1.83:1a91/c0496dd] compat {c=[1],r=[1],i=[1]}] [mds.test.cephadm2.ynwuib{ffffffff:2cf2c} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.82:1a92/de17d7a,v1:192.168.1.82:1a93/de17d7a] compat {c=[1],r=[1],i=[1]}] [mds.scratch.cephadm3.cvahkj{ffffffff:2cf2d} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.83:1a92/db28153,v1:192.168.1.83:1a93/db28153] compat {c=[1],r=[1],i=[1]}] [mds.scratch.cephadm1.ciszkl{ffffffff:2cf3d} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.81:1a92/c9319f5f,v1:192.168.1.81:1a93/c9319f5f] compat {c=[1],r=[1],i=[7ff]}] [mds.scratch.cephadm1.qiaeri{ffffffff:2cf43} state up:standby seq 1 join_fscid=2 addr [v2:192.168.1.81:1a94/abf2f0ac,v1:192.168.1.81:1a95/abf2f0ac] compat {c=[1],r=[1],i=[7ff]}] [mds.test.cephadm1.aplsnh{ffffffff:2cf47} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.81:1a90/e676cca3,v1:192.168.1.81:1a91/e676cca3] compat {c=[1],r=[1],i=[1]}] [mds.test.cephadm1.podojo{ffffffff:2cf51} state up:standby seq 1 join_fscid=1 addr [v2:192.168.1.81:1a9e/78b58683,v1:192.168.1.81:1a9f/78b58683] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 875
I'm blowing away the cluster now as I need the hardware, but there does seem to be an issue there. Also attaching the cephadm_status script output.
Files
Updated by Jeff Layton over 2 years ago
[ceph: root@cephadm1 /]# ceph versions { "mon": { "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 3 }, "mgr": { "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 2 }, "osd": { "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 3 }, "mds": { "ceph version 17.0.0-6543-g7f522532 (7f5225325376bebf75aecc8551248ba913628577) quincy (dev)": 5, "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 3 }, "overall": { "ceph version 17.0.0-6543-g7f522532 (7f5225325376bebf75aecc8551248ba913628577) quincy (dev)": 5, "ceph version 17.0.0-6869-g2dc3da6f (2dc3da6f3d80504b5c2aa117a614d94de457a1e8) quincy (dev)": 11 } }
Updated by Patrick Donnelly over 2 years ago
If you hit this again, increase debugging on mons to debug_mon=20 and let it chew for 30s-1m so we can hopefully see why it's not promoting the standbys.
Updated by Jeff Layton almost 2 years ago
- Status changed from New to Can't reproduce
Haven't seen this in some time.
Actions