Project

General

Profile

Actions

Bug #57048

open

osdc/Journaler: better handle ENOENT during replay as up:standby-replay

Added by Patrick Donnelly over 1 year ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Support
Tags:
backport_processed
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -15> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) recover start
   -14> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) read_head
   -13> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  4 mds.0.log Waiting for journal 0x200 to recover...
   -12> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_read_head loghead(trim 7788696698880, expire 7788721262080, write 7789114335530, stream_format 1).  probing for end of log (from 7789114335530)...
   -11> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) probing for end of the log
   -10> 2022-07-29T13:23:34.742+0000 7f1eed56d700 10 monclient: get_auth_request con 0x55c5deab7000 auth_method 0
    -9> 2022-07-29T13:23:34.755+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_probe_end write_pos = 7789125276317 (header had 7789114335530). recovered.
    -8> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Journal 0x200 recovered.
    -7> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Recovered journal 0x200 in format 1
    -6> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  2 mds.0.0 Booting: 1: loading/discovering base inodes
    -5> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x100
    -4> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x1
    -3> 2022-07-29T13:23:34.757+0000 7f1ee455b700  2 mds.0.0 Booting: 2: replaying mds log
    -2> 2022-07-29T13:23:34.798+0000 7f1ee455b700  0 mds.21387370.journaler.mdlog(ro) _finish_read got error -2
    -1> 2022-07-29T13:23:34.800+0000 7f1ee2d58700 -1 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f1ee2d58700 time 2022-07-29T13:23:34.799865+0000
/builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: 1383: FAILED ceph_assert(journaler->is_readable() || mds->is_daemon_stopping())

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f1ef2fc2b60]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 3: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 4: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 5: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 6: clone()

     0> 2022-07-29T13:23:34.801+0000 7f1ee2d58700 -1 *** Caught signal (Aborted) **
 in thread 7f1ee2d58700 thread_name:md_log_replay

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f1ef1d6dc20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1ef2fc2bb1]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 6: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 7: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 8: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 9: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Journaler should not cause the MDS to assert in this situation. We should handle this more gracefully.


Related issues 3 (2 open1 closed)

Copied to CephFS - Backport #64758: squid: osdc/Journaler: better handle ENOENT during replay as up:standby-replayIn ProgressVenky ShankarActions
Copied to CephFS - Backport #64759: reef: osdc/Journaler: better handle ENOENT during replay as up:standby-replayResolvedVenky ShankarActions
Copied to CephFS - Backport #64760: quincy: osdc/Journaler: better handle ENOENT during replay as up:standby-replayIn ProgressVenky ShankarActions
Actions #1

Updated by Venky Shankar over 1 year ago

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects? How can it end up in such a situation. (I recall you mentioning this in a BZ somewhere).

Actions #2

Updated by Patrick Donnelly over 1 year ago

Venky Shankar wrote:

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects?

Well, it supposed to already:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1320-L1325

Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?

How can it end up in such a situation. (I recall you mentioning this in a BZ somewhere).

It can happen if the standby-replay daemon gets behind the MDS it's following. In that case, the only thing to do is reboot the standby-replay daemon because it cannot know what the missing journal segments contained. It should do that gracefully in a way which does not freak out the cluster admin. But a cluster INFO/WRN would be useful to record the event in case there is a pattern.

Actions #3

Updated by Greg Farnum over 1 year ago

Patrick Donnelly wrote:

Venky Shankar wrote:

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects?

Well, it supposed to already:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1320-L1325

Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?

is_readable() is guaranteed to be false if we've hit the block at 1322 because of:

if (error != 0) {
return false;

I've seen plenty of instances of standby-replay MDSes respawning when they fall behind the journal, so there are some working code paths, but this doesn't seem like one of them to me.

Actions #4

Updated by Patrick Donnelly over 1 year ago

Greg Farnum wrote:

Patrick Donnelly wrote:

Venky Shankar wrote:

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects?

Well, it supposed to already:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1320-L1325

Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?

is_readable() is guaranteed to be false if we've hit the block at 1322 because of:

if (error != 0) {
return false;

If block 1322 is run then 1381 should be unreachable because:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1374

Actions #5

Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v18.0.0)
Actions #6

Updated by Venky Shankar 3 months ago

  • Category set to Correctness/Safety
  • Assignee set to Venky Shankar
  • Target version set to v19.0.0
Actions #7

Updated by Venky Shankar 3 months ago

The calls to journaler->is_readable() and journaler->get_error() in MDLog::_replay_thread() will drop Journaler::lock between invocations, so, theoretically, its possible that the initial check:

  // loop
  int r = 0;
  while (1) {
    // wait for read?
    while (!journaler->is_readable() &&
       journaler->get_read_pos() < journaler->get_write_pos() &&
       !journaler->get_error()) {
      C_SaferCond readable_waiter;
      journaler->wait_for_readable(&readable_waiter);
      r = readable_waiter.wait();
    }
    if (journaler->get_error()) {
      r = journaler->get_error();
      dout(0) << "_replay journaler got error " << r << ", aborting" << dendl;

journaler->is_readable() returned true, thereby breaking out of the (inner) while loop and by passing the journaler->get_error() check, and by the time this hits the next set of checks:

    if (!journaler->is_readable() &&
    journaler->get_read_pos() == journaler->get_write_pos())
      break;

    ceph_assert(journaler->is_readable() || mds->is_daemon_stopping());

It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?

Actions #8

Updated by Venky Shankar 3 months ago

  • Backport changed from quincy,pacific to quincy,reef
Actions #9

Updated by Patrick Donnelly 2 months ago

Venky Shankar wrote:

The calls to journaler->is_readable() and journaler->get_error() in MDLog::_replay_thread() will drop Journaler::lock between invocations, so, theoretically, its possible that the initial check:

[...]

journaler->is_readable() returned true, thereby breaking out of the (inner) while loop and by passing the journaler->get_error() check, and by the time this hits the next set of checks:

[...]

It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?

Yes, that makes sense. The "stopping" check originates from 39cf07118583166287ef0faa1811ae8efc9bef85. I think it along with the `is_readable` check can be thrown out. At this point, try to read an entry (which does a readable check) and do normal error handling.

Actions #10

Updated by Venky Shankar 2 months ago

Patrick Donnelly wrote:

Venky Shankar wrote:

The calls to journaler->is_readable() and journaler->get_error() in MDLog::_replay_thread() will drop Journaler::lock between invocations, so, theoretically, its possible that the initial check:

[...]

journaler->is_readable() returned true, thereby breaking out of the (inner) while loop and by passing the journaler->get_error() check, and by the time this hits the next set of checks:

[...]

It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?

Yes, that makes sense. The "stopping" check originates from 39cf07118583166287ef0faa1811ae8efc9bef85. I think it along with the `is_readable` check can be thrown out. At this point, try to read an entry (which does a readable check) and do normal error handling.

Yeh, those assert checks are racy. I wonder why we do not run into these failure with our fs suite (maybe out s-r jobs aren't enough or aren't really metadata workload intensive).

Actions #11

Updated by Venky Shankar 2 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 55639
Actions #12

Updated by Venky Shankar about 2 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from quincy,reef to quincy,reef,squid
Actions #13

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64758: squid: osdc/Journaler: better handle ENOENT during replay as up:standby-replay added
Actions #14

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64759: reef: osdc/Journaler: better handle ENOENT during replay as up:standby-replay added
Actions #15

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64760: quincy: osdc/Journaler: better handle ENOENT during replay as up:standby-replay added
Actions #16

Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF