Bug #51866: mds daemon damaged after outage - CephFS - Ceph

Actions

Copy link

Bug #51866

closed

mds daemon damaged after outage

Added by David Piper over 2 years ago. Updated over 2 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Seen on a containerised test cluster with 3 x MON, 4 x OSD, 2 x MDS.

ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)

We've simulated a complete outage of all three MON instances by rebooting the host servers (which overlap with one of the MDS instances). The cluster has been deployed using ceph-ansible, so all our ceph daemons are controlled by systemd and are restarted when the servers are back up.

Expected outcome ================
Once the servers reboot fully, ceph cluster returns to a healthy state.

Actual outcome ==============
The cluster has failed to recover completely. The MDS is marked as damaged:

[qs-admin@newbrunswick1 ~]$ sudo docker exec 696db49641b7 ceph -s
cluster:
id: 7a4265b6-605a-4dbc-9eaa-ec5d9ff62c2a
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged

services:
    mon: 3 daemons, quorum newbrunswick0,newbrunswick1,newbrunswick2 (age 5m)
    mgr: newbrunswick0(active, since 4m), standbys: newbrunswick1, newbrunswick2
    mds: cephfs:0/1 2 up:standby, 1 damaged
    osd: 4 osds: 4 up (since 4m), 4 in (since 6d)
    rgw: 8 daemons active (newbrunswick0.pubsub, newbrunswick0.rgw0, newbrunswick1.pubsub, newbrunswick1.rgw0, newbrunswick2.pubsub, newbrunswick2.rgw0, newbrunswick3.pubsub, newbrunswick3.rgw0)

task status:

data:
    pools:   14 pools, 165 pgs
    objects: 43.09k objects, 51 MiB
    usage:   4.3 GiB used, 396 GiB / 400 GiB avail
    pgs:     165 active+clean

io:
    client:   168 KiB/s rd, 2.2 KiB/s wr, 130 op/s rd, 25 op/s wr

[qs-admin@newbrunswick1 ~]$

Both MDSs are in standby state.

This seems to be 100% reproducible. I've attached logs from the MDS that was active before the reboots; the reboot was at Jul 27 09:30

Files

Download all files

ceph-mds-newbrunswick2.container.log (170 KB) ceph-mds-newbrunswick2.container.log	Container logs from MDS that was active before the outage	David Piper, 07/27/2021 09:48 AM
ceph-mon1.txt (87.5 KB) ceph-mon1.txt		David Piper, 07/27/2021 11:23 AM
ceph-mon0.txt (141 KB) ceph-mon0.txt		David Piper, 07/27/2021 11:23 AM
ceph-mon2.txt (113 KB) ceph-mon2.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-1.txt (532 KB) ceph-osd-1.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-0.txt (407 KB) ceph-osd-0.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-2.txt (396 KB) ceph-osd-2.txt		David Piper, 07/27/2021 11:23 AM
ceph-osd-3.txt (278 KB) ceph-osd-3.txt		David Piper, 07/27/2021 11:23 AM
start_mds.sh (3.19 KB) start_mds.sh	Script for delaying MDS start until clean pgs	David Piper, 08/25/2021 03:01 PM
rgw-death-log.txt (3.74 KB) rgw-death-log.txt		Alex Kershaw, 09/08/2021 01:58 PM
mds.0.txt (469 KB) mds.0.txt		Alex Kershaw, 09/08/2021 02:04 PM
mon.0.txt (728 KB) mon.0.txt		Alex Kershaw, 09/08/2021 02:04 PM

Actions

Copy link

Updated by Dan van der Ster over 2 years ago


Jul 27 09:29:38 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]: 2021-07-27T09:29:38.505+0000 7f76e1941700  1 mds.0.288 handle_mds_map i am now mds.0.288
Jul 27 09:29:38 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]: 2021-07-27T09:29:38.505+0000 7f76e1941700  1 mds.0.288 handle_mds_map state change up:boot --> up:replay
Jul 27 09:29:38 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]: 2021-07-27T09:29:38.505+0000 7f76e1941700  1 mds.0.288 replay_start
Jul 27 09:29:38 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]: 2021-07-27T09:29:38.505+0000 7f76e1941700  1 mds.0.288  waiting for osdmap 87650 (which blacklists prior instance)
Jul 27 09:30:08 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]: 2021-07-27T09:30:08.590+0000 7f76db134700 -1 mds.0.inotable: load_2 could not read table: -110
Jul 27 09:30:08 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]: 2021-07-27T09:30:08.590+0000 7f76db134700 -1 log_channel(cluster) log [ERR] : error reading table object 'mds0_inotable' -110 ((110) Connection timed out)
Jul 27 09:30:10 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]: 2021-07-27T09:30:10.858+0000 7f76db134700  1 mds.newbrunswick2 respawn!
...
Jul 27 09:30:11 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]:   -13> 2021-07-27T09:30:08.590+0000 7f76db134700 -1 mds.0.inotable: load_2 could not read table: -110
Jul 27 09:30:11 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]:   -12> 2021-07-27T09:30:08.590+0000 7f76db134700 -1 log_channel(cluster) log [ERR] : error reading table object 'mds0_inotable' -110 ((110) Connection timed out)
Jul 27 09:30:11 newbrunswick2 container_name/ceph-mds-newbrunswick2[861]:   -11> 2021-07-27T09:30:08.590+0000 7f76db134700  5 mds.beacon.newbrunswick2 set_want_state: up:replay -> down:damaged

Do you have the cluster or mon or osd logs from this same time -- maybe the relevant PGs are inactive so the read of mds0_inotable times out.

Since this is reproduceable, you could try delaying the MDS restart until after all PGs are active and see if it "fixes" this.

Actions

Copy link Download all files

Updated by David Piper over 2 years ago

File ceph-mon0.txt ceph-mon0.txt added
File ceph-mon1.txt ceph-mon1.txt added
File ceph-mon2.txt ceph-mon2.txt added
File ceph-osd-0.txt ceph-osd-0.txt added
File ceph-osd-1.txt ceph-osd-1.txt added
File ceph-osd-2.txt ceph-osd-2.txt added
File ceph-osd-3.txt ceph-osd-3.txt added

MON / OSD logs attached from between 09:25:00 and 09:31:00

I'll try the MDS delay and get back to you.

Actions

Copy link

Updated by Dan van der Ster over 2 years ago

The PGs are not active until 09:30:14, so that's evidence for my theory.
This looks like a bootstrapping issue arising when MDS's are co-located with OSDs.

Jul 27 09:30:00 newbrunswick0 docker[1244]: debug 2021-07-27T09:29:59.999+0000 7fb5dd9ee700  0 log_channel(cluster) log [WRN] : [WRN] PG_AVAILABILITY: Reduced data availability: 39 pgs inactive, 84 pgs peering
Jul 27 09:30:14 newbrunswick0 docker[1244]: debug 2021-07-27T09:30:14.111+0000 7fb5dd9ee700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 3 pgs inactive, 49 pgs peering)
Jul 27 09:30:14 newbrunswick0 docker[1244]: debug 2021-07-27T09:30:14.111+0000 7fb5dd9ee700  0 log_channel(cluster) log [INF] : Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 30/129252 objects degraded (0.023%), 2 pgs degraded)

You can also try `ceph mds repaired 0` when the fs is in damaged state, after all PGs are active+clean.

Actions

Copy link

Updated by David Piper over 2 years ago

Thanks Dan, that's a convincing diagnosis. We've diverted our attention to other projects but I'll look into implementing something smarter restart logic to delay the MDS restart.

Should ceph be smart enough to handle this natively? i.e. MDS will start in a holding state until PGs are all active, or recover automatically when the PGs return to active?

We've been using `ceph mds repaired 0` to return to a healthy state, thanks.

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Project changed from Ceph to CephFS

Actions

Copy link

Updated by David Piper over 2 years ago

File start_mds.sh start_mds.sh added

We've delayed MDS restarts with a script that waits for `active+clean` pgs first. Script is attached; it's borrowed heavily from this script from the ceph-ansible project: https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-handler/templates/restart_osd_daemon.sh.j2

Initial testing looks good but we're still putting this through its paces.

Do the pgs need to be `active+clean` or are there other pg states that it would be safe to start MDS during? e.g. combinations of undersized / degraded / peered etc.

Actions

Copy link

Updated by Dan van der Ster over 2 years ago

David Piper wrote:

Do the pgs need to be `active+clean` or are there other pg states that it would be safe to start MDS during? e.g. combinations of undersized / degraded / peered etc.

All PGs being `active` is sufficient. If you start the MDS while metadata PGs are active+degraded/undersized, then the IOs causing this issue should succeed, but I suppose there is a chance that an IO may take longer than the timeout causing this issue. If you want to be 100% sure of success, wait for `active+clean` before starting the MDS.

BTW, I was looking for, but didn't yet find, the rados client io timeout which seems to be at the root of this issue. Finding that, I suppose that an alternative solution would be to set it to something large, e.g. 300s. But I'm unfamiliar in this area -- normally rados IOs are not supposed to timeout ever. The MDS must be using some intentionally time-limited IO when bootstrapping. Someone more familiar could perhaps clarify how this works.

Actions

Copy link Download all files

Updated by Alex Kershaw over 2 years ago

File rgw-death-log.txt rgw-death-log.txt added
File mds.0.txt mds.0.txt added
File mon.0.txt mon.0.txt added

Hi, I'm working with David and am now looking at this issue. Unfortunately it seems the fix hasn't worked.

I've just had this reproduce when delaying MDS start-up till all PGs are active and clean. I've attached some logs from this. I can't see anything in them that's an obvious giveaway as to what is going on. hoping I'm missing something. Happy to provide any other logs that may be helpful.

I've also observed on reboot of the VMs that the RGW daemons typically restart after around a minute, which seem to be caused by RADOS client timeouts also. This happens for both RGW and pubsub processes. Typically they restart once and then are happy, but on occasion I've seen them get stuck in a loop, constantly hitting timeouts and restarting every minute or so - I've found restarting OSD seems to recover them in this case. I've attached a log excerpt showing this also.

AIUI we don't know what the cause of the timeouts are, but we don't observe this on a Nautilus deployment. I'm not familiar enough with the internals of Ceph to know where to look to try and investigate these, so any advice is appreciated.

Thanks

Actions

Copy link

Updated by Alex Kershaw over 2 years ago

I should mention, I've only had this reproduce once after we added the fix, when previously it was 100% reproducible, so it certainly seems to be improved after adding the delay.

Actions

Copy link

#10

Updated by Alex Kershaw over 2 years ago

Hi again - another update on this one.

We deploy on both OpenStack and VMware hosted VMs, with the intention that they resulting VM is functionally identical. However I can't reproduce this on a VMware hosted cluster of VMs, so I'm investigating why that may be - i.e. what the difference is on our deployment types.

I still think that the RADOS client timeouts are the culprit, with the MDS damage being a symptom of that problem - as I mentioned earlier the RGW restarts I experienced also log that they hit RADOS client timeouts.

I'm expecting I eventually uncover something amiss with our OpenStack VM bootstrapping that is causing this, but I think this may still be an underlying Ceph MDS bug that this RADOS client timeout leaves the cluster in a state where the filesystem is offline and can't auto-recover.

Meanwhile any advice on how to debug the RADOS client timeouts is greatly appreciated. I'll report back if I find the cause. Thanks.

Actions

Copy link

#11

Updated by Alex Kershaw over 2 years ago

...and I've just found that we're globally setting:

```
rados_osd_op_timeout: 30
rados_mon_op_timeout: 30
```

As per: https://tracker.ceph.com/issues/7386 seems like this is off-piste. I removed the config and can't reproduce the original issue.

So I guess that this isn't a bug after all, just bad config from us - I'm not sure why we have got away with this till now, and also not sure why it only happens on OpenStack deployments, but probably not interesting to anyone else at this point!

Actions

Copy link

#12

Updated by Dan van der Ster over 2 years ago

Status changed from New to Can't reproduce

rados_osd_op_timeout: 30

yeah that'll do it. thanks for clarifying the root cause!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #51866

mds daemon damaged after outage

Updated by Dan van der Ster over 2 years ago

Updated by David Piper over 2 years ago

Updated by Dan van der Ster over 2 years ago

Updated by David Piper over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by David Piper over 2 years ago

Updated by Dan van der Ster over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Alex Kershaw over 2 years ago

Updated by Dan van der Ster over 2 years ago