Actions
Bug #64597
openMDS Crashing Repeatedly in UP:Replay (Failed Assert)
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Description
Came in after the weekend and found all our Active/Standby MDS are crashed out. It seems to get past the journal recovery:
Feb 27 05:48:55 ceph-mon1 ceph-mds[275320]: mds.0.log Journal 0x200 recovered.
Feb 27 05:48:55 ceph-mon1 ceph-mds[275320]: mds.0.log Recovered journal 0x200 in format 1
Then crashes somewhere around "Sending beacon up:replay seq 4", with an assert failure.
Feb 27 05:49:27 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 484892, rss 208404, heap 207132, baseline 182556, 0 / 298 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:28 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 484892, rss 208404, heap 207132, baseline 182556, 0 / 501 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:29 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq Sending beacon up:replay seq 3
Feb 27 05:49:29 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq received beacon reply up:replay seq 3 rtt 0.00100001
Feb 27 05:49:29 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 487964, rss 211632, heap 207132, baseline 182556, 0 / 2879 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:30 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 493084, rss 216384, heap 207132, baseline 182556, 0 / 3924 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:31 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 499228, rss 223248, heap 207132, baseline 182556, 0 / 3662 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:32 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 499228, rss 223248, heap 207132, baseline 182556, 0 / 2207 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:33 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq Sending beacon up:replay seq 4
Feb 27 05:49:33 ceph-mon1 ceph-mds[275995]: mds.beacon.our-cephfs.ceph-mon1.hpysoq received beacon reply up:replay seq 4 rtt 0.00100001
Feb 27 05:49:33 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 499228, rss 223248, heap 207132, baseline 182556, 0 / 3361 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:34 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 499228, rss 223248, heap 207132, baseline 182556, 0 / 1494 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:35 ceph-mon1 ceph-mds[275995]: mds.0.cache Memory usage: total 499228, rss 223248, heap 207132, baseline 182556, 0 / 1035 inodes have caps, 0 caps, 0 caps per inode
Feb 27 05:49:35 ceph-mon1 ceph-5e958f94-22fb-11eb-934f-0c42a12ce4f3-mds-our-cephfs-ceph-mon1-hpysoq[275972]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)
------
CephFS Status:
# ceph fs status
our-cephfs - 140 clients
==========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 replay(laggy) our-cephfs.ceph-mon1.hpysoq 24.6k 3638 530 0
POOL TYPE USED AVAIL
cephfs.our-cephfs.meta metadata 1500G 851T
cephfs.our-cephfs.data data 0 851T
cephfs.our-ec-cephfs.data data 579T 1824T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
------
CephFS Dump:
# ceph fs dump
e645003
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1
Filesystem 'our-cephfs' (1)
fs_name our-cephfs
epoch 645003
flags 12 joinable allow_snaps allow_multimds_snaps
created 2020-11-12T22:36:24.252526+0000
modified 2024-02-27T13:51:06.040460+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 50000000000000
required_client_features {}
last_failure 0
last_failure_osd_epoch 284724
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=2626744}
failed
damaged
stopped
data_pools [4,5]
metadata_pool 3
inline_data disabled
balancer
standby_count_wanted 1
[mds.our-cephfs.ceph-mon1.hpysoq{0:2626744} state up:replay seq 1 laggy since 2024-02-27T13:51:06.040439+0000 join_fscid=1 addr [v2:10.0.5.1:6800/3107480305,v1:10.0.5.1:6801/3107480305] compat {c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 645003
Files
Updated by Gavin Baker 2 months ago
- File assert-dump.log assert-dump.log added
The full assert section of the MDS logs shows this interesting line.
Feb 27 12:35:14 ceph-mon1 ceph-5e958f94-22fb-11eb-934f-0c42a12ce4f3-mds-our-cephfs-ceph-mon1-hpysoq[422106]: -9999> 2024-02-27T20:35:14.548+0000 7fc0da01c700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7fc0da01c700 time 2024-02-27T20:35:14.548959+0000
Feb 27 12:35:14 ceph-mon1 ceph-5e958f94-22fb-11eb-934f-0c42a12ce4f3-mds-our-cephfs-ceph-mon1-hpysoq[422106]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)
"-9999> 2024-02-27T20:35:14.548+0000" seems to cause the failed assert? I've attached the log dump of the asserts that happen after the "received beacon reply up:replay seq 4".
Updated by Gavin Baker 2 months ago
It looks like the journal integrity check is fine:
# cephfs-journal-tool --rank=our-cephfs:all journal inspect
Overall journal integrity: OK
Actions