Project

General

Profile

Bug #62208

mds: use MDSRank::abort to ceph_abort so necessary sync is done

Added by Patrick Donnelly 7 months ago. Updated 5 months ago.

Status:
Fix Under Review
Priority:
High
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The MDS calls ceph_abort("msg") in various places. If there is any pending cluster log messages to be sent to the mons, those messages may be lost when the MDS hard stops due to abort. To guarantee those messages are flushed before abort, use the new MDSRank::abort method instead of ceph_abort.

See discussion: https://github.com/ceph/ceph/pull/52638#discussion_r1276289186


Related issues

Blocked by CephFS - Bug #62164: qa: "cluster [ERR] MDS abort because newly corrupt dentry to be committed: [dentry #0x1/a [fffffffffffffff6,head] auth (dversion lock) v=13..." Resolved

History

#1 Updated by Venky Shankar 7 months ago

  • Backport changed from reef to reef,quincy,pacific

backport note - we might need to pull in additional commits for p/q.

#2 Updated by Venky Shankar 7 months ago

  • Category set to Administration/Usability
  • Assignee set to Leonid Usov

Leonid, please take this one.

#3 Updated by Leonid Usov 7 months ago

  • Status changed from New to Need More Info
  • Assignee changed from Leonid Usov to Patrick Donnelly

Hi Venky!

I've started looking at this one, but the ticket doesn't provide sufficient information so I don't know how to proceed with it.
It mentions a PR that seemingly adds the functionality requested here, and it's referencing a different ticket in the system: https://tracker.ceph.com/issues/62164. There is no info on how to reproduce this problem and no mentions of the other ticket which might be a duplicate

I'm assigning this issue to Patric with Need More Info

#4 Updated by Patrick Donnelly 7 months ago

  • Description updated (diff)
  • Status changed from Need More Info to New
  • Assignee changed from Patrick Donnelly to Leonid Usov

#5 Updated by Leonid Usov 7 months ago

  • Blocked by Bug #62164: qa: "cluster [ERR] MDS abort because newly corrupt dentry to be committed: [dentry #0x1/a [fffffffffffffff6,head] auth (dversion lock) v=13..." added

#6 Updated by Leonid Usov 7 months ago

  • Status changed from New to In Progress

OK thanks Patrick! So this will wait until you submit the linked issue that introduces the new method.

What are other strategies one could use? Is it a common practice to base PRs on other unmerged PRs they depend on?

#7 Updated by Patrick Donnelly 7 months ago

Leonid Usov wrote:

OK thanks Patrick! So this will wait until you submit the linked issue that introduces the new method.

What are other strategies one could use? Is it a common practice to base PRs on other unmerged PRs they depend on?

It's uncommon but totally fine.

#8 Updated by Leonid Usov 7 months ago

  • Status changed from In Progress to New

updating the status to New until the API becomes available

#9 Updated by Leonid Usov 6 months ago

  • Status changed from New to In Progress
  • Pull request ID set to 53014

#10 Updated by Leonid Usov 6 months ago

  • Status changed from In Progress to Fix Under Review

#11 Updated by Leonid Usov 5 months ago

  • Assignee changed from Leonid Usov to Patrick Donnelly

Reassigning to Patrick following our discussion on this topic. Patrick wanted to add some generic safe async signal handler framework

Also available in: Atom PDF