Bug #51866: mds daemon damaged after outage - CephFS - Ceph

Actions

Copy link

Bug #51866

closed

mds daemon damaged after outage

Added by David Piper almost 3 years ago. Updated over 2 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Seen on a containerised test cluster with 3 x MON, 4 x OSD, 2 x MDS.

ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)

We've simulated a complete outage of all three MON instances by rebooting the host servers (which overlap with one of the MDS instances). The cluster has been deployed using ceph-ansible, so all our ceph daemons are controlled by systemd and are restarted when the servers are back up.

Expected outcome ================
Once the servers reboot fully, ceph cluster returns to a healthy state.

Actual outcome ==============
The cluster has failed to recover completely. The MDS is marked as damaged:

[qs-admin@newbrunswick1 ~]$ sudo docker exec 696db49641b7 ceph -s
cluster:
id: 7a4265b6-605a-4dbc-9eaa-ec5d9ff62c2a
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged

services:
    mon: 3 daemons, quorum newbrunswick0,newbrunswick1,newbrunswick2 (age 5m)
    mgr: newbrunswick0(active, since 4m), standbys: newbrunswick1, newbrunswick2
    mds: cephfs:0/1 2 up:standby, 1 damaged
    osd: 4 osds: 4 up (since 4m), 4 in (since 6d)
    rgw: 8 daemons active (newbrunswick0.pubsub, newbrunswick0.rgw0, newbrunswick1.pubsub, newbrunswick1.rgw0, newbrunswick2.pubsub, newbrunswick2.rgw0, newbrunswick3.pubsub, newbrunswick3.rgw0)

task status:

data:
    pools:   14 pools, 165 pgs
    objects: 43.09k objects, 51 MiB
    usage:   4.3 GiB used, 396 GiB / 400 GiB avail
    pgs:     165 active+clean

io:
    client:   168 KiB/s rd, 2.2 KiB/s wr, 130 op/s rd, 25 op/s wr

[qs-admin@newbrunswick1 ~]$

Both MDSs are in standby state.

This seems to be 100% reproducible. I've attached logs from the MDS that was active before the reboots; the reboot was at Jul 27 09:30

Files

Download all files

ceph-mds-newbrunswick2.container.log (170 KB) ceph-mds-newbrunswick2.container.log	Container logs from MDS that was active before the outage	David Piper, 07/27/2021 09:48 AM
ceph-mon1.txt (87.5 KB) ceph-mon1.txt		David Piper, 07/27/2021 11:23 AM
ceph-mon0.txt (141 KB) ceph-mon0.txt		David Piper, 07/27/2021 11:23 AM
ceph-mon2.txt (113 KB) ceph-mon2.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-1.txt (532 KB) ceph-osd-1.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-0.txt (407 KB) ceph-osd-0.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-2.txt (396 KB) ceph-osd-2.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-3.txt (278 KB) ceph-osd-3.txt		David Piper, 07/27/2021 11:23 AM
start_mds.sh (3.19 KB) start_mds.sh	Script for delaying MDS start until clean pgs	David Piper, 08/25/2021 03:01 PM
rgw-death-log.txt (3.74 KB) rgw-death-log.txt		Alex Kershaw, 09/08/2021 01:58 PM
mds.0.txt (469 KB) mds.0.txt		Alex Kershaw, 09/08/2021 02:04 PM
mon.0.txt (728 KB) mon.0.txt		Alex Kershaw, 09/08/2021 02:04 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #51866

mds daemon damaged after outage

Updated by Dan van der Ster almost 3 years ago

Updated by David Piper almost 3 years ago

Updated by Dan van der Ster almost 3 years ago

Updated by David Piper over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by David Piper over 2 years ago

Updated by Dan van der Ster over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Dan van der Ster over 2 years ago