Bug #65094
openmds STATE_STARTING won't add root ino for root rank and not correctly handle when fails at STATE_STARTING
0%
Description
root rank doesn't add root inode to its subtree auth when it enters STATE_STARTING,
also it doesn't handle it STATE_STARTING correctly when mds fails or is stopped at STARTING.
This will cause rank damage or ceph_assert failure when mds failover/switchover happens later.
the following are related logs
a.
-1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank!
b.
-15> 2024-03-24T18:06:19.461+0800 7f1542cbf700 0 mds.0.journal EMetaBlob.replay missing dir ino 0x10000000000
-14> 2024-03-24T18:06:19.461+0800 7f1542cbf700 -1 log_channel(cluster) log [ERR] : failure replaying journal (EMetaBlob)
c.
-6> 2024-03-24T19:39:59.593+0800 7f5903f02700 -1 log_channel(cluster) log [ERR] : replayed ESubtreeMap at 4209845 subtree root 0x1 not in cache
Files
Updated by ethan wu about 1 month ago
pull request: https://github.com/ceph/ceph/pull/56429
Updated by Venky Shankar about 1 month ago
- Status changed from New to Fix Under Review
- Backport set to quincy,reef,squid
- Pull request ID set to 56429
Updated by ethan wu about 1 month ago
pull request: https://github.com/ceph/ceph/pull/56429
In my cephfs environment, I got mds replay failure log.
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay request client.4317:5 trim_to 5
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.log _replay 4217639~3155 / 4225716 2024-03-25T20:47:35.334606+0800: EUpdate unlink_local [metablob 0x10000000000, 4 dirs]
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EUpdate::replay
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay 4 dirlumps by unknown.0
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay don't have renamed ino 0x10000000003
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay found null dentry in dir 0x10000000001
2024-03-25T20:47:49.630+0800 7fa83bbe8700 10 mds.0.journal EMetaBlob.replay dir 0x10000000000
2024-03-25T20:47:49.630+0800 7fa83bbe8700 0 mds.0.journal EMetaBlob.replay missing dir ino 0x10000000000
2024-03-25T20:47:49.630+0800 7fa83bbe8700 -1 log_channel(cluster) log [ERR] : failure replaying journal (EMetaBlob)
2024-03-25T20:47:49.630+0800 7fa83bbe8700 5 mds.beacon.b set_want_state: up:replay -> down:damaged
After investigating, I found out it's related to mds STARTING state.
mds STATE_STARTING doesn't add ino 0x1 into root rank subtrees, so the all inode under 0x1 got trimmed by
try_trim_nonauth_subtree.
Way to reproduce it:
1. Using vstart.sh to create a cephfs, (but turn off mds_debug_subtrees).
2. mount cephfs
3. mkdir -p ${cephfs_root}/dir1/dir11/foo; mkdir -p ${cephfs_root}/dir1/dir11/bar
4. umount cephfs
5. ./bin/ceph fs set a down true # wait for all mds stop
6. ./bin/ceph fs set a down false
7. mount cephfs
8. rmdir ${cephfs_root}/dir1/dir11/foo; rmdir ${cephfs_root}/dir1/dir11/bar
9. umount cephfs
10. kill rank 0 mds and trigger failover
11. ./bin/ceph fs dump # rank 0 is marked damaged
And during fix the issue, I also found bugs that error handling of STATE_STARTING isn't correct.
1. Take-over mds won't enter STATE_STARTING again when mds fails before STATE_STARTING finishes.
2. Even mds finishes STATE_STARTING and request STATE_ACTIVE, the mds log created at STATE_STARTING didn't get flushed.
Take-over mds will fail at replay assert that subtree should not be empty.
-1 log_channel(cluster) log [ERR] : No subtrees found for root MDS rank!
The subtree map log is not flushed
Updated by Patrick Donnelly about 1 month ago
- Category set to Correctness/Safety
- Assignee set to ethan wu
- Target version set to v20.0.0
- Source set to Community (dev)