Project

General

Profile

Bug #48711

mds: standby-replay mds abort when replay metablob

Added by haitao chen over 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph Version 14.2.15
OS: CentOS 7.6.1810
We create a fs that have three active mds, three standby-replay mds, three standby mds.
Create a dir and then export it with samba or nfs-ganesha for long IO testing. When run sometimes, the standby-replay mds will be crashed and the log list below:

   -16> 2020-12-22 09:17:40.419 7fa2fcb1e700  5 mds.beacon.node181-0 Sending beacon up:standby-replay seq 82158
   -15> 2020-12-22 09:17:40.420 7fa2fcb1e700 10 monclient: _send_mon_message to mon.node181 at v1:10.0.50.181:6789/0
   -14> 2020-12-22 09:17:40.420 7fa302329700  5 mds.beacon.node181-0 received beacon reply up:standby-replay seq 82158 rtt 0.000999997
   -13> 2020-12-22 09:17:40.618 7fa2fd31f700  5 mds.1.0 Restarting replay as standby-replay
   -12> 2020-12-22 09:17:40.619 7fa2f9317700  1 mds.147261.journaler.mdlog(ro) probing for end of the log
   -11> 2020-12-22 09:17:40.619 7fa2f9317700  1 mds.147261.journaler.mdlog(ro) _finish_reprobe new_end = 31909564363 (header had 31909185210).
   -10> 2020-12-22 09:17:40.619 7fa2f9317700  2 mds.1.0 Booting: 2: replaying mds log

    -9> 2020-12-22 09:17:40.646 7fa2f7b14700  0 mds.1.journal EMetaBlob.replay missing dir ino  0x1000009ba45
    -8> 2020-12-22 09:17:40.646 7fa2f7b14700 -1 log_channel(cluster) log [ERR] : failure replaying journal (EMetaBlob)
    -7> 2020-12-22 09:17:40.646 7fa2f7b14700  5 mds.beacon.node181-0 set_want_state: up:standby-replay -> down:damaged

    -6> 2020-12-22 09:17:40.646 7fa2f7b14700 10 log_client  log_queue is 1 last_log 1 sent 0 num 1 unsent 1 sending 1
    -5> 2020-12-22 09:17:40.646 7fa2f7b14700 10 log_client  will send 2020-12-22 09:17:40.647702 mds.node181-0 (mds.147261) 1 : cluster [ERR] failure replaying journal (EMetaBlob)
    -4> 2020-12-22 09:17:40.647 7fa2f7b14700 10 monclient: _send_mon_message to mon.node181 at v1:10.0.50.181:6789/0
    -3> 2020-12-22 09:17:40.647 7fa2f7b14700  5 mds.beacon.node181-0 Sending beacon down:damaged seq 82159
    -2> 2020-12-22 09:17:40.648 7fa2f7b14700 10 monclient: _send_mon_message to mon.node181 at v1:10.0.50.181:6789/0
    -1> 2020-12-22 09:17:40.648 7fa302329700  5 mds.beacon.node181-0 received beacon reply down:damaged seq 82159 rtt 0.000999996
     0> 2020-12-22 09:17:40.648 7fa2f7b14700  1 mds.node181-0 respawn!
--- logging levels ---
   0/ 5 none

It occurs two times in long IO testing.
The attach file is the completely log.

standby_replay_crash.zip - complete log file (84.2 KB) haitao chen, 12/24/2020 02:55 AM

History

#2 Updated by Nathan Cutler about 3 years ago

  • Project changed from Ceph to CephFS
  • Subject changed from [ceph-mds]standby-replay mds abort when replay metablob to standby-replay mds abort when replay metablob
  • Component(FS) MDS added

#3 Updated by Patrick Donnelly about 3 years ago

  • Subject changed from standby-replay mds abort when replay metablob to mds: standby-replay mds abort when replay metablob
  • Status changed from New to Triaged
  • Assignee set to Jos Collin

#4 Updated by Jos Collin over 2 years ago

  • Status changed from Triaged to Need More Info

Hi haitao,

I don't see a segmentation fault here and the attached logs doesn't have more information about the crash. As I checked the code by referring the logs attached, this should be a normal respawn() scenario when there is a failure in replaying journal. Could you please attach the complete mds logs, at the time when the crash occurred? More info?

I see a similar issue [1] in ceph-users mailing-list last year, but that should be a different cluster (different version).
[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/N63T7KAZ4CIGT5OEH4STV67KJPJA4JTN/

#5 Updated by Jos Collin over 2 years ago

  • Status changed from Need More Info to Closed

No updates from haitao yet, closing this.

Also available in: Atom PDF