Project

General

Profile

Actions

Bug #44023

open

MDS continuously crashing on v14.2.7

Added by Michael Sudnick about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have max mds set to 2, though I have tried fiddling with the values since hitting the crash. Ceph status indicates the mds rejoining and reconnecting, and then looping, with the mds crashing. I've attached a log which I think captures the relevant crash from one of my MDS daemons. The crashes happen with or without the keyring entries in ceph.conf, I added them while debugging to progress past an earlier encountered issue which suggested an auth problem.


Files

ceph.conf (813 Bytes) ceph.conf Michael Sudnick, 02/06/2020 10:19 PM
ceph-mds.ceph0.log (54.9 KB) ceph-mds.ceph0.log Michael Sudnick, 02/06/2020 10:24 PM
ceph-mon.ceph4.log.post (59.1 KB) ceph-mon.ceph4.log.post Michael Sudnick, 02/08/2020 05:15 PM
Actions #1

Updated by Michael Sudnick about 4 years ago

I have tried resetting the MDS map to no avail. Also have tried failing the filesystem and then setting it joinable without success.

Actions #2

Updated by Michael Sudnick about 4 years ago

It looks like the MDSes are not being assigned a rank when they come up, ceph fs get cephfs shows:
Filesystem 'cephfs' (5)
fs_name cephfs
epoch 253032
flags 12
created 2020-02-06 19:12:11.351844
modified 2020-02-07 11:02:51.328320
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
min_compat_client -1 (unspecified)
last_failure 0
last_failure_osd_epoch 459380
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=302606143}
failed
damaged
stopped 1
data_pools [68]
metadata_pool 67
inline_data disabled
balancer
standby_count_wanted 1
302606143: [v2:10.0.151.0:6832/2622467401,v1:10.0.151.0:6833/2622467401] 'ceph1' mds.0.253029 up:rejoin seq 5 laggy since 2020-02-07 11:02:51.328271

Actions #3

Updated by Michael Sudnick about 4 years ago

Rolling back to 14.2.6 did not fix the issue.

Actions #5

Updated by Michael Sudnick about 4 years ago

Managed to mess around and recover by adding wipe_sessions to ceph.conf, sorry for the false alarm. This can be closed.

Actions #6

Updated by Patrick Donnelly about 4 years ago

  • Project changed from Ceph to CephFS
Actions

Also available in: Atom PDF