Project

General

Profile

Bug #40288

mds: lost mds journal when hot-standby mds switch occurs

Added by Ivan Guan about 1 month ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
06/12/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:

Description

ceph version: jewel 10.2.2
mds mode: hot-standby

There is a risk mds lost some event because it wake up waiters who’s journal hasn’t been flush to disk. Thus would cause at least three errors in my environment.
1. report “dir not empty” when deleting directory
2. report “no such file or directory” when create file
3. can’t list any files or dir when do ls, but report “file exist” when create a old dir.
As the following example:

The above case may be caused by the fllowing time series:
t1: journaler start flushing event 99
t2: journaler start flushing event 100
t3: event 101 and event 102 is appended to write_buf and next_safe_pos move to write_pos_101
t4: the finish flush of event 100 call back and the safe_pos = next_safe_pos_102 due to pending_safe is empty
t5: journaler wake up the waiters of event 99, event 100 and event 101 and response to client (note: the event 99 and event 101 hasn’t be flushed to disk)
t6: hot-standby mds switch occurs and the event 101 and event 99 will be lost

If the lost event is “unlink_local” client will receive a response of unlink file success mistakenly. So when client finish the unlinking of the last file it will do rmdir of the parent dir but the mds report “dir not empty” because the file still exists in mds side.
If the lost event is “mkdir” and client do create file under the dir mds will report “no such file or directory” because the directory didn’t create successfully in mds. Actually, in addition to the above problems, ti can also causes the dir fnode statistics errors which will lead to more problems.

mds_journal.png View (46.1 KB) Ivan Guan, 06/12/2019 02:44 AM

History

#1 Updated by Ivan Guan about 1 month ago

Sorry, there doesn't seems to have any problem, it's my misunderstanding. Turn off this issue please, thank you!

#2 Updated by Greg Farnum about 1 month ago

  • Project changed from Ceph to fs
  • Status changed from New to Closed

Also available in: Atom PDF