Actions
Bug #57048
openosdc/Journaler: better handle ENOENT during replay as up:standby-replay
Status:
Pending Backport
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Support
Tags:
backport_processed
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
-15> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700 1 mds.21387370.journaler.mdlog(ro) recover start -14> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700 1 mds.21387370.journaler.mdlog(ro) read_head -13> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700 4 mds.0.log Waiting for journal 0x200 to recover... -12> 2022-07-29T13:23:34.742+0000 7f1ee455b700 1 mds.21387370.journaler.mdlog(ro) _finish_read_head loghead(trim 7788696698880, expire 7788721262080, write 7789114335530, stream_format 1). probing for end of log (from 7789114335530)... -11> 2022-07-29T13:23:34.742+0000 7f1ee455b700 1 mds.21387370.journaler.mdlog(ro) probing for end of the log -10> 2022-07-29T13:23:34.742+0000 7f1eed56d700 10 monclient: get_auth_request con 0x55c5deab7000 auth_method 0 -9> 2022-07-29T13:23:34.755+0000 7f1ee455b700 1 mds.21387370.journaler.mdlog(ro) _finish_probe_end write_pos = 7789125276317 (header had 7789114335530). recovered. -8> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 4 mds.0.log Journal 0x200 recovered. -7> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 4 mds.0.log Recovered journal 0x200 in format 1 -6> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 2 mds.0.0 Booting: 1: loading/discovering base inodes -5> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 0 mds.0.cache creating system inode with ino:0x100 -4> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700 0 mds.0.cache creating system inode with ino:0x1 -3> 2022-07-29T13:23:34.757+0000 7f1ee455b700 2 mds.0.0 Booting: 2: replaying mds log -2> 2022-07-29T13:23:34.798+0000 7f1ee455b700 0 mds.21387370.journaler.mdlog(ro) _finish_read got error -2 -1> 2022-07-29T13:23:34.800+0000 7f1ee2d58700 -1 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f1ee2d58700 time 2022-07-29T13:23:34.799865+0000 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: 1383: FAILED ceph_assert(journaler->is_readable() || mds->is_daemon_stopping()) ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f1ef2fc2b60] 2: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a] 3: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec] 4: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101] 5: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a] 6: clone() 0> 2022-07-29T13:23:34.801+0000 7f1ee2d58700 -1 *** Caught signal (Aborted) ** in thread 7f1ee2d58700 thread_name:md_log_replay ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) 1: /lib64/libpthread.so.0(+0x12c20) [0x7f1ef1d6dc20] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1ef2fc2bb1] 5: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a] 6: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec] 7: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101] 8: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a] 9: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Journaler should not cause the MDS to assert in this situation. We should handle this more gracefully.
Actions