Project

General

Profile

Actions

Bug #51866

closed

mds daemon damaged after outage

Added by David Piper almost 3 years ago. Updated over 2 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen on a containerised test cluster with 3 x MON, 4 x OSD, 2 x MDS.

ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)

We've simulated a complete outage of all three MON instances by rebooting the host servers (which overlap with one of the MDS instances). The cluster has been deployed using ceph-ansible, so all our ceph daemons are controlled by systemd and are restarted when the servers are back up.

Expected outcome ================
Once the servers reboot fully, ceph cluster returns to a healthy state.

Actual outcome ==============
The cluster has failed to recover completely. The MDS is marked as damaged:

[qs-admin@newbrunswick1 ~]$ sudo docker exec 696db49641b7 ceph -s
cluster:
id: 7a4265b6-605a-4dbc-9eaa-ec5d9ff62c2a
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged

services:
mon: 3 daemons, quorum newbrunswick0,newbrunswick1,newbrunswick2 (age 5m)
mgr: newbrunswick0(active, since 4m), standbys: newbrunswick1, newbrunswick2
mds: cephfs:0/1 2 up:standby, 1 damaged
osd: 4 osds: 4 up (since 4m), 4 in (since 6d)
rgw: 8 daemons active (newbrunswick0.pubsub, newbrunswick0.rgw0, newbrunswick1.pubsub, newbrunswick1.rgw0, newbrunswick2.pubsub, newbrunswick2.rgw0, newbrunswick3.pubsub, newbrunswick3.rgw0)
task status:
data:
pools: 14 pools, 165 pgs
objects: 43.09k objects, 51 MiB
usage: 4.3 GiB used, 396 GiB / 400 GiB avail
pgs: 165 active+clean
io:
client: 168 KiB/s rd, 2.2 KiB/s wr, 130 op/s rd, 25 op/s wr

[qs-admin@newbrunswick1 ~]$

Both MDSs are in standby state.

This seems to be 100% reproducible. I've attached logs from the MDS that was active before the reboots; the reboot was at Jul 27 09:30


Files

ceph-mds-newbrunswick2.container.log (170 KB) ceph-mds-newbrunswick2.container.log Container logs from MDS that was active before the outage David Piper, 07/27/2021 09:48 AM
ceph-mon1.txt (87.5 KB) ceph-mon1.txt David Piper, 07/27/2021 11:23 AM
ceph-mon0.txt (141 KB) ceph-mon0.txt David Piper, 07/27/2021 11:23 AM
ceph-mon2.txt (113 KB) ceph-mon2.txt David Piper, 07/27/2021 11:23 AM
ceph-osd-1.txt (532 KB) ceph-osd-1.txt David Piper, 07/27/2021 11:23 AM
ceph-osd-0.txt (407 KB) ceph-osd-0.txt David Piper, 07/27/2021 11:23 AM
ceph-osd-2.txt (396 KB) ceph-osd-2.txt David Piper, 07/27/2021 11:23 AM
ceph-osd-3.txt (278 KB) ceph-osd-3.txt David Piper, 07/27/2021 11:23 AM
start_mds.sh (3.19 KB) start_mds.sh Script for delaying MDS start until clean pgs David Piper, 08/25/2021 03:01 PM
rgw-death-log.txt (3.74 KB) rgw-death-log.txt Alex Kershaw, 09/08/2021 01:58 PM
mds.0.txt (469 KB) mds.0.txt Alex Kershaw, 09/08/2021 02:04 PM
mon.0.txt (728 KB) mon.0.txt Alex Kershaw, 09/08/2021 02:04 PM
Actions

Also available in: Atom PDF