Bug #62208: mds: use MDSRank::abort to ceph_abort so necessary sync is done - CephFS - Ceph

Actions

Copy link

Bug #62208

open

mds: use MDSRank::abort to ceph_abort so necessary sync is done

Added by Patrick Donnelly 9 months ago. Updated 7 months ago.

Status:

Fix Under Review

Priority:

High

Assignee:

Patrick Donnelly

Category:

Administration/Usability

Target version:

Ceph - v19.0.0

% Done:

Source:

Development

Tags:

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

53014

Crash signature (v1):

Crash signature (v2):

Description

The MDS calls ceph_abort("msg") in various places. If there is any pending cluster log messages to be sent to the mons, those messages may be lost when the MDS hard stops due to abort. To guarantee those messages are flushed before abort, use the new MDSRank::abort method instead of ceph_abort.

See discussion: https://github.com/ceph/ceph/pull/52638#discussion_r1276289186

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Venky Shankar 9 months ago

Backport changed from reef to reef,quincy,pacific

backport note - we might need to pull in additional commits for p/q.

Actions

Copy link

Updated by Venky Shankar 9 months ago

Category set to Administration/Usability
Assignee set to Leonid Usov

Leonid, please take this one.

Actions

Copy link

Updated by Leonid Usov 9 months ago

Status changed from New to Need More Info
Assignee changed from Leonid Usov to Patrick Donnelly

Hi Venky!

I've started looking at this one, but the ticket doesn't provide sufficient information so I don't know how to proceed with it.
It mentions a PR that seemingly adds the functionality requested here, and it's referencing a different ticket in the system: https://tracker.ceph.com/issues/62164. There is no info on how to reproduce this problem and no mentions of the other ticket which might be a duplicate

I'm assigning this issue to Patric with Need More Info

Actions

Copy link

Updated by Patrick Donnelly 9 months ago

Description updated (diff)
Status changed from Need More Info to New
Assignee changed from Patrick Donnelly to Leonid Usov

Actions

Copy link

Updated by Leonid Usov 9 months ago

Blocked by Bug #62164: qa: "cluster [ERR] MDS abort because newly corrupt dentry to be committed: [dentry #0x1/a [fffffffffffffff6,head] auth (dversion lock) v=13..." added

Actions

Copy link

Updated by Leonid Usov 9 months ago

Status changed from New to In Progress

OK thanks Patrick! So this will wait until you submit the linked issue that introduces the new method.

What are other strategies one could use? Is it a common practice to base PRs on other unmerged PRs they depend on?

Actions

Copy link

Updated by Patrick Donnelly 9 months ago

Leonid Usov wrote:

OK thanks Patrick! So this will wait until you submit the linked issue that introduces the new method.

What are other strategies one could use? Is it a common practice to base PRs on other unmerged PRs they depend on?

It's uncommon but totally fine.

Actions

Copy link