Bug #57048
openosdc/Journaler: better handle ENOENT during replay as up:standby-replay
0%
Description
-15> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700 1 mds.21387370.journaler.mdlog(ro) recover start -14> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700 1 mds.21387370.journaler.mdlog(ro) read_head -13> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700 4 mds.0.log Waiting for journal 0x200 to recover... -12> 2022-07-29T13:23:34.742+0000 7f1ee455b700 1 mds.21387370.journaler.mdlog(ro) _finish_read_head loghead(trim 7788696698880, expire 7788721262080, write 7789114335530, stream_format 1). probing for end of log (from 7789114335530)... -11> 2022-07-29T13:23:34.742+0000 7f1ee455b700 1 mds.21387370.journaler.mdlog(ro) probing for end of the log -10> 2022-07-29T13:23:34.742+0000 7f1eed56d700 10 monclient: get_auth_request con 0x55c5deab7000 auth_method 0 -9> 2022-07-29T13:23:34.755+0000 7f1ee455b700 1 mds.21387370.journaler.mdlog(ro) _finish_probe_end write_pos = 7789125276317 (header had 7789114335530). recovered. -8> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 4 mds.0.log Journal 0x200 recovered. -7> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 4 mds.0.log Recovered journal 0x200 in format 1 -6> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 2 mds.0.0 Booting: 1: loading/discovering base inodes -5> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 0 mds.0.cache creating system inode with ino:0x100 -4> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 0 mds.0.cache creating system inode with ino:0x1 -3> 2022-07-29T13:23:34.757+0000 7f1ee455b700 2 mds.0.0 Booting: 2: replaying mds log -2> 2022-07-29T13:23:34.798+0000 7f1ee455b700 0 mds.21387370.journaler.mdlog(ro) _finish_read got error -2 -1> 2022-07-29T13:23:34.800+0000 7f1ee2d58700 -1 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f1ee2d58700 time 2022-07-29T13:23:34.799865+0000 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: 1383: FAILED ceph_assert(journaler->is_readable() || mds->is_daemon_stopping()) ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f1ef2fc2b60] 2: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a] 3: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec] 4: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101] 5: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a] 6: clone() 0> 2022-07-29T13:23:34.801+0000 7f1ee2d58700 -1 *** Caught signal (Aborted) ** in thread 7f1ee2d58700 thread_name:md_log_replay ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) 1: /lib64/libpthread.so.0(+0x12c20) [0x7f1ef1d6dc20] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1ef2fc2bb1] 5: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a] 6: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec] 7: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101] 8: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a] 9: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Journaler should not cause the MDS to assert in this situation. We should handle this more gracefully.
Updated by Venky Shankar over 1 year ago
Patrick,
Do you mean a standby-replay MDS should tolerate missing journal objects? How can it end up in such a situation. (I recall you mentioning this in a BZ somewhere).
Updated by Patrick Donnelly over 1 year ago
Venky Shankar wrote:
Patrick,
Do you mean a standby-replay MDS should tolerate missing journal objects?
Well, it supposed to already:
Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?
How can it end up in such a situation. (I recall you mentioning this in a BZ somewhere).
It can happen if the standby-replay daemon gets behind the MDS it's following. In that case, the only thing to do is reboot the standby-replay daemon because it cannot know what the missing journal segments contained. It should do that gracefully in a way which does not freak out the cluster admin. But a cluster INFO/WRN would be useful to record the event in case there is a pattern.
Updated by Greg Farnum over 1 year ago
Patrick Donnelly wrote:
Venky Shankar wrote:
Patrick,
Do you mean a standby-replay MDS should tolerate missing journal objects?
Well, it supposed to already:
Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?
is_readable() is guaranteed to be false if we've hit the block at 1322 because of:
if (error != 0) {
return false;
I've seen plenty of instances of standby-replay MDSes respawning when they fall behind the journal, so there are some working code paths, but this doesn't seem like one of them to me.
Updated by Patrick Donnelly over 1 year ago
Greg Farnum wrote:
Patrick Donnelly wrote:
Venky Shankar wrote:
Patrick,
Do you mean a standby-replay MDS should tolerate missing journal objects?
Well, it supposed to already:
Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?
is_readable() is guaranteed to be false if we've hit the block at 1322 because of:
if (error != 0) {
return false;
If block 1322 is run then 1381 should be unreachable because:
https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1374
Updated by Venky Shankar 3 months ago
- Category set to Correctness/Safety
- Assignee set to Venky Shankar
- Target version set to v19.0.0
Updated by Venky Shankar 3 months ago
The calls to journaler->is_readable()
and journaler->get_error()
in MDLog::_replay_thread() will drop Journaler::lock
between invocations, so, theoretically, its possible that the initial check:
// loop int r = 0; while (1) { // wait for read? while (!journaler->is_readable() && journaler->get_read_pos() < journaler->get_write_pos() && !journaler->get_error()) { C_SaferCond readable_waiter; journaler->wait_for_readable(&readable_waiter); r = readable_waiter.wait(); } if (journaler->get_error()) { r = journaler->get_error(); dout(0) << "_replay journaler got error " << r << ", aborting" << dendl;
journaler->is_readable()
returned true, thereby breaking out of the (inner) while loop and by passing the journaler->get_error()
check, and by the time this hits the next set of checks:
if (!journaler->is_readable() && journaler->get_read_pos() == journaler->get_write_pos()) break; ceph_assert(journaler->is_readable() || mds->is_daemon_stopping());
It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?
Updated by Venky Shankar 3 months ago
- Backport changed from quincy,pacific to quincy,reef
Updated by Patrick Donnelly 2 months ago
Venky Shankar wrote:
The calls to
journaler->is_readable()
andjournaler->get_error()
in MDLog::_replay_thread() will dropJournaler::lock
between invocations, so, theoretically, its possible that the initial check:[...]
journaler->is_readable()
returned true, thereby breaking out of the (inner) while loop and by passing thejournaler->get_error()
check, and by the time this hits the next set of checks:[...]
It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?
Yes, that makes sense. The "stopping" check originates from 39cf07118583166287ef0faa1811ae8efc9bef85. I think it along with the `is_readable` check can be thrown out. At this point, try to read an entry (which does a readable check) and do normal error handling.
Updated by Venky Shankar 2 months ago
Patrick Donnelly wrote:
Venky Shankar wrote:
The calls to
journaler->is_readable()
andjournaler->get_error()
in MDLog::_replay_thread() will dropJournaler::lock
between invocations, so, theoretically, its possible that the initial check:[...]
journaler->is_readable()
returned true, thereby breaking out of the (inner) while loop and by passing thejournaler->get_error()
check, and by the time this hits the next set of checks:[...]
It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?
Yes, that makes sense. The "stopping" check originates from 39cf07118583166287ef0faa1811ae8efc9bef85. I think it along with the `is_readable` check can be thrown out. At this point, try to read an entry (which does a readable check) and do normal error handling.
Yeh, those assert checks are racy. I wonder why we do not run into these failure with our fs suite (maybe out s-r jobs aren't enough or aren't really metadata workload intensive).
Updated by Venky Shankar 2 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 55639
Updated by Venky Shankar about 2 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from quincy,reef to quincy,reef,squid
Updated by Backport Bot about 2 months ago
- Copied to Backport #64758: squid: osdc/Journaler: better handle ENOENT during replay as up:standby-replay added
Updated by Backport Bot about 2 months ago
- Copied to Backport #64759: reef: osdc/Journaler: better handle ENOENT during replay as up:standby-replay added
Updated by Backport Bot about 2 months ago
- Copied to Backport #64760: quincy: osdc/Journaler: better handle ENOENT during replay as up:standby-replay added