Project

General

Profile

Bug #24073

PurgeQueue::_consume() could return true when there were no purge queue item actually executed.

Added by Xuehan Xu almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In our online clusters, we encountered the bug #19593. Although we cherry-pick the fixing commits, the purge queue's journal is already damaged. When trying to repair the journal, we found that the journal's head has not been updated for a long time, which is caused by PurgeQueue::_consume() method always returning true. So we think it might be necessary for this method to return false when there were actually no purge queue item executed, even if the bug #19593 is fixed. After all, there could be other bugs that could damage the journal.


Related issues

Related to CephFS - Bug #19593: purge queue and standby replay mds Resolved 04/12/2017
Copied to CephFS - Backport #24107: luminous: PurgeQueue::_consume() could return true when there were no purge queue item actually executed. Resolved

History

#2 Updated by Xuehan Xu almost 6 years ago

Xuehan Xu wrote:

In our online clusters, we encountered the bug #19593. Although we cherry-pick the fixing commits, the purge queue's journal is already damaged. When trying to repair the journal, we found that the journal's head has not been updated for a long time, which is caused by PurgeQueue::_consume() method always returning true. So we think it might be necessary for this method to return false when there were actually no purge queue item executed, even if the bug #19593 is fixed. After all, there could be other bugs that could damage the journal.

As we encountered bug #19593, the journal was damanged and the purge queue could not issue read as the journal is not readable, so the journaler's flush could not be triggered by the purge queue's read, and on the other hand, PurgeQueue:::_consume kept on returning true as the can_consume() method returned true, so the journaler's flush was never executed, so the journal's head was never updated.

#3 Updated by Patrick Donnelly almost 6 years ago

  • Category set to Correctness/Safety
  • Priority changed from Normal to High
  • Target version set to v13.2.0
  • Source set to Community (dev)
  • Backport set to luminous
  • Severity changed from 3 - minor to 2 - major
  • Component(FS) MDS added

#4 Updated by Patrick Donnelly almost 6 years ago

  • Related to Bug #19593: purge queue and standby replay mds added

#5 Updated by dongdong tao almost 6 years ago

Hi Xuehan,
I'm just curious about that how do you repair your purge queue journal ?

#6 Updated by Xuehan Xu almost 6 years ago

dongdong tao wrote:

Hi Xuehan,
I'm just curious about that how do you repair your purge queue journal ?

Actually, We have not repaired it yet. But we plan to rewrite the journal head manually, setting its expire_pos to the next journal entry beyond the damaged one and write_pos to the end of the journal. There are lots of journal entries beyond the damaged one, and we just can't afford to abandon them.

#7 Updated by Xuehan Xu almost 6 years ago

dongdong tao wrote:

Hi Xuehan,
I'm just curious about that how do you repair your purge queue journal ?

By the way, we also noticed your patch https://github.com/ceph/ceph/pull/19471 which enables cephfs-journal-tool to operate on purge queue journals. We plan to backport it to our code, and use it to rewrite the purge queue journal's head

#8 Updated by dongdong tao almost 6 years ago

Yeah, that‘s what i want to recommend to you, it can work as you expected.

#9 Updated by Xuehan Xu almost 6 years ago

dongdong tao wrote:

Yeah, that‘s what i want to recommend to you, it can work as you expected.

Thank you:-) That's very nice of you.

#10 Updated by Patrick Donnelly almost 6 years ago

  • Status changed from New to Pending Backport
  • Assignee set to Xuehan Xu

#11 Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #24107: luminous: PurgeQueue::_consume() could return true when there were no purge queue item actually executed. added

#12 Updated by Nathan Cutler almost 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF