Bug #57048: osdc/Journaler: better handle ENOENT during replay as up:standby-replay - CephFS - Ceph

Actions

Copy link

Bug #57048

open

osdc/Journaler: better handle ENOENT during replay as up:standby-replay

Added by Patrick Donnelly over 1 year ago. Updated about 2 months ago.

Status:

Pending Backport

Priority:

High

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Support

Tags:

backport_processed

Backport:

quincy,reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

55639

Crash signature (v1):

Crash signature (v2):

Description

   -15> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) recover start
   -14> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) read_head
   -13> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  4 mds.0.log Waiting for journal 0x200 to recover...
   -12> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_read_head loghead(trim 7788696698880, expire 7788721262080, write 7789114335530, stream_format 1).  probing for end of log (from 7789114335530)...
   -11> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) probing for end of the log
   -10> 2022-07-29T13:23:34.742+0000 7f1eed56d700 10 monclient: get_auth_request con 0x55c5deab7000 auth_method 0
    -9> 2022-07-29T13:23:34.755+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_probe_end write_pos = 7789125276317 (header had 7789114335530). recovered.
    -8> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Journal 0x200 recovered.
    -7> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Recovered journal 0x200 in format 1
    -6> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  2 mds.0.0 Booting: 1: loading/discovering base inodes
    -5> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x100
    -4> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x1
    -3> 2022-07-29T13:23:34.757+0000 7f1ee455b700  2 mds.0.0 Booting: 2: replaying mds log
    -2> 2022-07-29T13:23:34.798+0000 7f1ee455b700  0 mds.21387370.journaler.mdlog(ro) _finish_read got error -2
    -1> 2022-07-29T13:23:34.800+0000 7f1ee2d58700 -1 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f1ee2d58700 time 2022-07-29T13:23:34.799865+0000
/builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: 1383: FAILED ceph_assert(journaler->is_readable() || mds->is_daemon_stopping())

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f1ef2fc2b60]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 3: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 4: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 5: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 6: clone()

     0> 2022-07-29T13:23:34.801+0000 7f1ee2d58700 -1 *** Caught signal (Aborted) **
 in thread 7f1ee2d58700 thread_name:md_log_replay

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f1ef1d6dc20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1ef2fc2bb1]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 6: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 7: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 8: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 9: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Journaler should not cause the MDS to assert in this situation. We should handle this more gracefully.

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects? How can it end up in such a situation. (I recall you mentioning this in a BZ somewhere).

Actions

Copy link

Updated by Patrick Donnelly over 1 year ago

Venky Shankar wrote:

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects?

Well, it supposed to already:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1320-L1325

Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?

How can it end up in such a situation. (I recall you mentioning this in a BZ somewhere).

It can happen if the standby-replay daemon gets behind the MDS it's following. In that case, the only thing to do is reboot the standby-replay daemon because it cannot know what the missing journal segments contained. It should do that gracefully in a way which does not freak out the cluster admin. But a cluster INFO/WRN would be useful to record the event in case there is a pattern.

Actions

Copy link

Updated by Greg Farnum over 1 year ago

Patrick Donnelly wrote:

Venky Shankar wrote:

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects?

Well, it supposed to already:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1320-L1325

Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?

is_readable() is guaranteed to be false if we've hit the block at 1322 because of:

if (error != 0) {
return false;

I've seen plenty of instances of standby-replay MDSes respawning when they fall behind the journal, so there are some working code paths, but this doesn't seem like one of them to me.

Actions

Copy link

Updated by Patrick Donnelly over 1 year ago

Greg Farnum wrote:

Patrick Donnelly wrote:

Venky Shankar wrote:

Patrick,

Do you mean a standby-replay MDS should tolerate missing journal objects?

Well, it supposed to already:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1320-L1325

Note, ENOENT (-2) got converted to EAGAIN. However, I don't see why the assert on 1381 gets triggered. Maybe the assert is a different bug?

is_readable() is guaranteed to be false if we've hit the block at 1322 because of:

if (error != 0) {
return false;

If block 1322 is run then 1381 should be unreachable because:

https://github.com/ceph/ceph/blob/5ab5ff46d6d799c46d8a1c01cc08068a22e4c9b5/src/mds/MDLog.cc#L1374

Actions

Copy link

Updated by Patrick Donnelly 7 months ago

Target version deleted (~~v18.0.0~~)

Actions

Copy link

Updated by Venky Shankar 3 months ago

Category set to Correctness/Safety
Assignee set to Venky Shankar
Target version set to v19.0.0

Actions

Copy link

Updated by Venky Shankar 3 months ago

The calls to journaler->is_readable() and journaler->get_error() in MDLog::_replay_thread() will drop Journaler::lock between invocations, so, theoretically, its possible that the initial check:

  // loop
  int r = 0;
  while (1) {
    // wait for read?
    while (!journaler->is_readable() &&
       journaler->get_read_pos() < journaler->get_write_pos() &&
       !journaler->get_error()) {
      C_SaferCond readable_waiter;
      journaler->wait_for_readable(&readable_waiter);
      r = readable_waiter.wait();
    }
    if (journaler->get_error()) {
      r = journaler->get_error();
      dout(0) << "_replay journaler got error " << r << ", aborting" << dendl;

journaler->is_readable() returned true, thereby breaking out of the (inner) while loop and by passing the journaler->get_error() check, and by the time this hits the next set of checks:

    if (!journaler->is_readable() &&
    journaler->get_read_pos() == journaler->get_write_pos())
      break;

    ceph_assert(journaler->is_readable() || mds->is_daemon_stopping());

It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?

Actions

Copy link

Updated by Venky Shankar 3 months ago

Backport changed from quincy,pacific to quincy,reef

Actions

Copy link

Updated by Patrick Donnelly 2 months ago

Venky Shankar wrote:

The calls to journaler->is_readable() and journaler->get_error() in MDLog::_replay_thread() will drop Journaler::lock between invocations, so, theoretically, its possible that the initial check:

[...]

journaler->is_readable() returned true, thereby breaking out of the (inner) while loop and by passing the journaler->get_error() check, and by the time this hits the next set of checks:

[...]

It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?

Yes, that makes sense. The "stopping" check originates from 39cf07118583166287ef0faa1811ae8efc9bef85. I think it along with the `is_readable` check can be thrown out. At this point, try to read an entry (which does a readable check) and do normal error handling.

Actions

Copy link

#10

Updated by Venky Shankar 2 months ago

Patrick Donnelly wrote:

Venky Shankar wrote:

The calls to journaler->is_readable() and journaler->get_error() in MDLog::_replay_thread() will drop Journaler::lock between invocations, so, theoretically, its possible that the initial check:

[...]

journaler->is_readable() returned true, thereby breaking out of the (inner) while loop and by passing the journaler->get_error() check, and by the time this hits the next set of checks:

[...]

It's possible that the journal is unreadable due to some error that happened during prefetch. In short, it seems that these checks are racy. Can someone fault my logic?

Yes, that makes sense. The "stopping" check originates from 39cf07118583166287ef0faa1811ae8efc9bef85. I think it along with the `is_readable` check can be thrown out. At this point, try to read an entry (which does a readable check) and do normal error handling.

Yeh, those assert checks are racy. I wonder why we do not run into these failure with our fs suite (maybe out s-r jobs aren't enough or aren't really metadata workload intensive).

Actions

Copy link

#11