Bug #57764
Thread md_log_replay is hanged for ever.
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Tags:
standby-replay mds backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
In production environment, we have a problem: one standby-replay's md_log_replay thread is hanged.
1,The reason:
line1: while (!journaler->is_readable() &&
line2: journaler->get_read_pos() < journaler->get_write_pos() &&
line3: !journaler->get_error()) {
line4: C_SaferCond readable_waiter;
line5: journaler->wait_for_readable(&readable_waiter);
line6: r = readable_waiter.wait();
line7: }
This code is from void MDLog::_replay_thread().
(1), If the code enter the while and this thread ("md_log_replay") is switched to the MR_Finisher thread between line3 and line5. (HERE: journaler->get_read_pos() < journaler->get_write_pos())
(2), Then the MR_Finisher thread calls Journaler::C_Read: finish ls->_finish_read() -> _assimilate_prefetch().
a) In _assimilate_prefetch(), journaler->get_write_pos() maybe set to be equal to journaler->get_read_pos().
b) Because the variable on_readable is 0, the f->complete() will not be called!
if (on_readable) {
C_OnFinisher *f = on_readable;
on_readable = 0;
f->complete(0);
}
(3),Then the MR_Finisher thread is switched to the md_log_replay thread, it will hang on line6 forever !!
2, The stacktrace of the hang, how to reproduce the problem, the analysis of "why write_pos may be set to read_pos can happen", please refer to the link: https://github.com/ceph/ceph/pull/48281
3,The function MDLog::_reformat_journal() in recovery_thread may have the same problem. Please help to check it, thanks!
4,Now I thank maybe it is not easy to ensure the f->complete() is called before readable_waiter.wait(). Is there a way to make f->complete() is called earlier?
Related issues
History
#1 Updated by Venky Shankar over 1 year ago
- Category set to Correctness/Safety
- Status changed from New to Fix Under Review
- Target version set to v18.0.0
- Backport set to pacific,quincy
- Pull request ID set to 48281
#2 Updated by Venky Shankar over 1 year ago
Thanks for the bug report. Seems like you found a subtle race. I haven't gone through the fix yet, but I'll get to it soon. Thanks!
#3 Updated by Venky Shankar over 1 year ago
- Assignee set to zhikuo du
#4 Updated by Venky Shankar over 1 year ago
- Status changed from Fix Under Review to Pending Backport
#5 Updated by Backport Bot over 1 year ago
- Copied to Backport #58345: quincy: Thread md_log_replay is hanged for ever. added
#6 Updated by Backport Bot over 1 year ago
- Copied to Backport #58346: pacific: Thread md_log_replay is hanged for ever. added
#7 Updated by Backport Bot over 1 year ago
- Tags changed from standby-replay mds to standby-replay mds backport_processed
#8 Updated by Konstantin Shalygin 8 months ago
- Status changed from Pending Backport to Resolved