Bug #45024: mds: wrong link count under certain circumstance - CephFS - Ceph

Actions

Copy link

Bug #45024

closed

mds: wrong link count under certain circumstance

Added by Xinying Song about 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xinying Song

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Tags:

Backport:

octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

34507

Crash signature (v1):

Crash signature (v2):

Description

I'm simulating a condition that there are two active mds, when making a hard link cross the two mds, both mds and client crashes. As expected, when all MDS and client are up again, the hard link request should either succeed or be rollbacked. However, in my simulation, it is not. The crash is supposed to happen in a scene like this: slave mds has sent slave_prep_ack to master, and trimmed this log (this log shouldn't be tried but it did happen), then crash. At the sametime，somehow for master mds, the slave_prep_ack message is missing or delayed, and it also crahes. To achieve the message-missing effect, I modify the source code, making Server::handle_slave_link_prep_ack() return at its beginning.

So with the drop prep-ack version mds, we can reproduce the bug as follow steps:
1.set configs so that journal can be flushed as soon as possible:mds_log_events_per_segment = 1; mds_log_max_events = 1
2.run `ceph-fuse /mnt` on an individual machine.
3.pin dirs to different mds rank, run

mkdir /mnt/mds0
mkdir /mnt/mds1
setattr -n ceph.dir.pin -v 0 /mnt/mds0
setattr -n ceph.dir.pin -v 1 /mnt/mds1
touch /mnt/mds0/0
ln /mnt/mds0/0 /mnt/mds1/0-link   #this command will hang

4.wait for rank0 has flushed all logs. This can be judged by grep 'try_to_expire success' from mds log. In my environment, wait for 1 minutes is enough.
5.reboot client machine.
6.restart two mds, wait both of them are active
7.mount on the client. see results of `ls -l /mnt/mds0` and `ls -l /mnt/mds1`. /mnt/mds0/0 has a link count 2, but there is nothing under /mnt/mds1 dir.

I tried Luminous and master branch, both of them give the same result as above.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Xinying Song about 4 years ago

I opened a pr at https://github.com/ceph/ceph/pull/34507 to help discuss this problem. The pr is just a sample, it doesn't consider all slave-ops situations since I'm not sure if I'm working in the right direction.

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Status changed from New to Fix Under Review
Assignee set to Xinying Song
Target version set to v16.0.0
Backport set to octopus,nautilus
Pull request ID set to 34507
Component(FS) MDS added

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #45708: octopus: mds: wrong link count under certain circumstance added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #45709: nautilus: mds: wrong link count under certain circumstance added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Updated by Patrick Donnelly almost 4 years ago

Related to Bug #46533: mds: null pointer dereference in MDCache::finish_rollback added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #45024

mds: wrong link count under certain circumstance

Updated by Xinying Song about 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago