Bug #64597: MDS Crashing Repeatedly in UP:Replay (Failed Assert) - Ceph - Ceph

Actions

Copy link

Bug #64597

open

MDS Crashing Repeatedly in UP:Replay (Failed Assert)

Added by Gavin Baker 2 months ago. Updated 2 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v17.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Came in after the weekend and found all our Active/Standby MDS are crashed out. It seems to get past the journal recovery:

Feb 27 05:48:55 ceph-mon1 ceph-mds[275320]: mds.0.log Journal 0x200 recovered.
Feb 27 05:48:55 ceph-mon1 ceph-mds[275320]: mds.0.log Recovered journal 0x200 in format 1

Then crashes somewhere around "Sending beacon up:replay seq 4", with an assert failure.

Feb 27 05:49:27 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 484892, rss 208404, heap 207132, baseline 182556, 0 / 298 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:28 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 484892, rss 208404, heap 207132, baseline 182556, 0 / 501 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:29 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq Sending beacon up:replay seq 3
Feb 27 05:49:29 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq received beacon reply up:replay seq 3 rtt 0.00100001
Feb 27 05:49:29 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 487964, rss 211632, heap 207132, baseline 182556, 0 / 2879 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:30 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 493084, rss 216384, heap 207132, baseline 182556, 0 / 3924 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:31 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 499228, rss 223248, heap 207132, baseline 182556, 0 / 3662 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:32 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 499228, rss 223248, heap 207132, baseline 182556, 0 / 2207 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:33 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq Sending beacon up:replay seq 4
Feb 27 05:49:33 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq received beacon reply up:replay seq 4 rtt 0.00100001
Feb 27 05:49:33 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 499228, rss 223248, heap 207132, baseline 182556, 0 / 3361 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:34 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 499228, rss 223248, heap 207132, baseline 182556, 0 / 1494 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:35 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage:  total 499228, rss 223248, heap 207132, baseline 182556, 0 / 1035 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:35 ceph-mon1 ceph-5e958f94-22fb-11eb-934f-0c42a12ce4f3-mds-our-cephfs-ceph-mon1-hpysoq[275972]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)

------
CephFS Status:

# ceph fs status
our-cephfs - 140 clients
==========
RANK      STATE                  MDS              ACTIVITY   DNS    INOS   DIRS   CAPS  
 0    replay(laggy)  our-cephfs.ceph-mon1.hpysoq            24.6k  3638    530      0   
           POOL              TYPE     USED  AVAIL  
  cephfs.our-cephfs.meta   metadata  1500G   851T  
  cephfs.our-cephfs.data     data       0    851T  
cephfs.our-ec-cephfs.data    data     579T  1824T  
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

------
CephFS Dump:

# ceph fs dump
e645003
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'our-cephfs' (1)
fs_name    our-cephfs
epoch    645003
flags    12 joinable allow_snaps allow_multimds_snaps
created    2020-11-12T22:36:24.252526+0000
modified    2024-02-27T13:51:06.040460+0000
tableserver    0
root    0
session_timeout    60
session_autoclose    300
max_file_size    50000000000000
required_client_features    {}
last_failure    0
last_failure_osd_epoch    284724
compat    compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds    1
in    0
up    {0=2626744}
failed    
damaged    
stopped    
data_pools    [4,5]
metadata_pool    3
inline_data    disabled
balancer    
standby_count_wanted    1
[mds.our-cephfs.ceph-mon1.hpysoq{0:2626744} state up:replay seq 1 laggy since 2024-02-27T13:51:06.040439+0000 join_fscid=1 addr [v2:10.0.5.1:6800/3107480305,v1:10.0.5.1:6801/3107480305] compat {c=[1],r=[1],i=[7ff]}]

dumped fsmap epoch 645003

Files

assert-dump.log (32.8 KB) assert-dump.log

Gavin Baker, 02/27/2024 08:51 PM

Actions

Copy link

Updated by Gavin Baker 2 months ago

File assert-dump.log assert-dump.log added

The full assert section of the MDS logs shows this interesting line.

Feb 27 12:35:14 ceph-mon1 ceph-5e958f94-22fb-11eb-934f-0c42a12ce4f3-mds-our-cephfs-ceph-mon1-hpysoq[422106]:  -9999> 2024-02-27T20:35:14.548+0000 7fc0da01c700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7fc0da01c700 time 2024-02-27T20:35:14.548959+0000
Feb 27 12:35:14 ceph-mon1 ceph-5e958f94-22fb-11eb-934f-0c42a12ce4f3-mds-our-cephfs-ceph-mon1-hpysoq[422106]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)

"-9999> 2024-02-27T20:35:14.548+0000" seems to cause the failed assert? I've attached the log dump of the asserts that happen after the "received beacon reply up:replay seq 4".

Actions

Copy link

Updated by Gavin Baker 2 months ago

It looks like the journal integrity check is fine:


# cephfs-journal-tool --rank=our-cephfs:all journal inspect
Overall journal integrity: OK

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #64597

MDS Crashing Repeatedly in UP:Replay (Failed Assert)

Updated by Gavin Baker 2 months ago

Updated by Gavin Baker 2 months ago