Bug #65094: mds STATE_STARTING won't add root ino for root rank and not correctly handle when fails at STATE_STARTING - CephFS - Ceph

Actions

Copy link

Bug #65094

open

mds STATE_STARTING won't add root ino for root rank and not correctly handle when fails at STATE_STARTING

Added by ethan wu about 1 month ago. Updated about 1 month ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

ethan wu

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

quincy,reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

56429

Crash signature (v1):

Crash signature (v2):

Description

root rank doesn't add root inode to its subtree auth when it enters STATE_STARTING,
also it doesn't handle it STATE_STARTING correctly when mds fails or is stopped at STARTING.

This will cause rank damage or ceph_assert failure when mds failover/switchover happens later.

the following are related logs

a.
-1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank!

b.
-15> 2024-03-24T18:06:19.461+0800 7f1542cbf700 0 mds.0.journal EMetaBlob.replay missing dir ino 0x10000000000
-14> 2024-03-24T18:06:19.461+0800 7f1542cbf700 -1 log_channel(cluster) log [ERR] : failure replaying journal (EMetaBlob)

c.
-6> 2024-03-24T19:39:59.593+0800 7f5903f02700 -1 log_channel(cluster) log [ERR] : replayed ESubtreeMap at 4209845 subtree root 0x1 not in cache

Files

mds.b.log (354 KB) mds.b.log

ethan wu, 03/25/2024 12:56 PM

Actions

Copy link

Updated by ethan wu about 1 month ago

pull request: https://github.com/ceph/ceph/pull/56429

Actions

Copy link

Updated by Venky Shankar about 1 month ago

Status changed from New to Fix Under Review
Backport set to quincy,reef,squid
Pull request ID set to 56429

Actions

Copy link

Updated by ethan wu about 1 month ago

File mds.b.log mds.b.log added

pull request: https://github.com/ceph/ceph/pull/56429

In my cephfs environment, I got mds replay failure log.

2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay request client.4317:5 trim_to 5
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.log _replay 4217639~3155 / 4225716 2024-03-25T20:47:35.334606+0800: EUpdate unlink_local [metablob 0x10000000000, 4 dirs]
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EUpdate::replay
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay 4 dirlumps by unknown.0
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay don't have renamed ino 0x10000000003
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay found null dentry in dir 0x10000000001
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay dir 0x10000000000
2024-03-25T20:47:49.630+0800 7fa83bbe8700 0 mds.0.journal EMetaBlob.replay missing dir ino 0x10000000000
2024-03-25T20:47:49.630+0800 7fa83bbe8700 -1 log_channel(cluster) log [ERR] : failure replaying journal (EMetaBlob)
2024-03-25T20:47:49.630+0800 7fa83bbe8700 5 mds.beacon.b set_want_state: up:replay -> down:damaged

After investigating, I found out it's related to mds STARTING state.
mds STATE_STARTING doesn't add ino 0x1 into root rank subtrees, so the all inode under 0x1 got trimmed by
try_trim_nonauth_subtree.

Way to reproduce it:
1. Using vstart.sh to create a cephfs, (but turn off mds_debug_subtrees).
2. mount cephfs
3. mkdir -p ${cephfs_root}/dir1/dir11/foo; mkdir -p ${cephfs_root}/dir1/dir11/bar
4. umount cephfs
5. ./bin/ceph fs set a down true # wait for all mds stop
6. ./bin/ceph fs set a down false
7. mount cephfs
8. rmdir ${cephfs_root}/dir1/dir11/foo; rmdir ${cephfs_root}/dir1/dir11/bar
9. umount cephfs
10. kill rank 0 mds and trigger failover
11. ./bin/ceph fs dump # rank 0 is marked damaged

And during fix the issue, I also found bugs that error handling of STATE_STARTING isn't correct.
1. Take-over mds won't enter STATE_STARTING again when mds fails before STATE_STARTING finishes.
2. Even mds finishes STATE_STARTING and request STATE_ACTIVE, the mds log created at STATE_STARTING didn't get flushed.

Take-over mds will fail at replay assert that subtree should not be empty.

-1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank!
The subtree map log is not flushed

Actions

Copy link

Updated by Patrick Donnelly about 1 month ago

Category set to Correctness/Safety
Assignee set to ethan wu
Target version set to v20.0.0
Source set to Community (dev)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #65094

mds STATE_STARTING won't add root ino for root rank and not correctly handle when fails at STATE_STARTING

Updated by ethan wu about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by ethan wu about 1 month ago

Updated by Patrick Donnelly about 1 month ago