Bug #51866
closedmds daemon damaged after outage
0%
Description
Seen on a containerised test cluster with 3 x MON, 4 x OSD, 2 x MDS.
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
We've simulated a complete outage of all three MON instances by rebooting the host servers (which overlap with one of the MDS instances). The cluster has been deployed using ceph-ansible, so all our ceph daemons are controlled by systemd and are restarted when the servers are back up.
Expected outcome
================
Once the servers reboot fully, ceph cluster returns to a healthy state.
Actual outcome
==============
The cluster has failed to recover completely. The MDS is marked as damaged:
[qs-admin@newbrunswick1 ~]$ sudo docker exec 696db49641b7 ceph -s
cluster:
id: 7a4265b6-605a-4dbc-9eaa-ec5d9ff62c2a
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
services:
mon: 3 daemons, quorum newbrunswick0,newbrunswick1,newbrunswick2 (age 5m)
mgr: newbrunswick0(active, since 4m), standbys: newbrunswick1, newbrunswick2
mds: cephfs:0/1 2 up:standby, 1 damaged
osd: 4 osds: 4 up (since 4m), 4 in (since 6d)
rgw: 8 daemons active (newbrunswick0.pubsub, newbrunswick0.rgw0, newbrunswick1.pubsub, newbrunswick1.rgw0, newbrunswick2.pubsub, newbrunswick2.rgw0, newbrunswick3.pubsub, newbrunswick3.rgw0)
task status:
data:
pools: 14 pools, 165 pgs
objects: 43.09k objects, 51 MiB
usage: 4.3 GiB used, 396 GiB / 400 GiB avail
pgs: 165 active+clean
io:
client: 168 KiB/s rd, 2.2 KiB/s wr, 130 op/s rd, 25 op/s wr
[qs-admin@newbrunswick1 ~]$
Both MDSs are in standby state.
This seems to be 100% reproducible. I've attached logs from the MDS that was active before the reboots; the reboot was at Jul 27 09:30
Files