Project

General

Profile

Actions

Bug #63806

closed

ffsb.sh workunit failure (MDS: std::out_of_range, damaged)

Added by Venky Shankar 5 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,reef
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2023-12-06_15:12:46-fs-wip-vshankar-testing-20231206.125818-testing-default-smithi/7480362

The test runs periodic `flush journal` command and the ceph cli core dumps

2023-12-07T00:26:58.507 INFO:teuthology.task.background_exec.ubuntu@smithi105.front.sepia.ceph.com.smithi105.stderr:terminate called after throwing an instance of 'std::out_of_range'
2023-12-07T00:26:58.507 INFO:teuthology.task.background_exec.ubuntu@smithi105.front.sepia.ceph.com.smithi105.stderr:  what():  map::at
2023-12-07T00:26:59.074 INFO:teuthology.task.background_exec.ubuntu@smithi105.front.sepia.ceph.com.smithi105.stderr:bash: line 1: 168811 Aborted                 (core dumped) ceph tell mds.cephfs:0 flush journal

The `std::out_of_range` hints that there aren't any active MDSs to send the `flush journal` (tell) command.

FWIW, the standby-replay daemon is also getting marked as damaged.

2023-12-07T00:26:44.881+0000 7f40f5a1f700  1 -- [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] --> [v2:172.21.15.181:6816/602100546,v1:172.21.15.181:6817/602100546] -- osd_op(unknown.0.0:1178 2.b 2:d5c7a900:::200.00000003:head [read 44891~4149413 [fadvise_dontneed]] snapc 0=[] ondisk+read+known_if_redirected+full_force+supports_pool_eio e85) v8 -- 0x555d6d62e380 con 0x555d6d7d0400
2023-12-07T00:26:44.960+0000 7f4100234700  1 -- [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] <== osd.8 v2:172.21.15.181:6816/602100546 187 ==== osd_op_reply(1178 200.00000003 [read 44891~8887 [fadvise_dontneed] out=8887b] v0'0 uv35 ondisk = 0) v8 ==== 156+0+8887 (crc 0 0 0) 0x555d6c9aafc0 con 0x555d6d7d0400
2023-12-07T00:26:44.960+0000 7f40f7222700  0 mds.24469.journaler.mdlog(ro) _finish_read got less than expected (4149413)
2023-12-07T00:26:44.960+0000 7f40f5a1f700  0 mds.0.log _replay journaler got error -22, aborting

And then later the standby-replay MDS gets marked as damaged

2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 mds.0.log  maybe trim LogSegment(3824/0xc06c0c events=8)
2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 mds.0.log  won't remove, not expired!
2023-12-07T00:26:45.410+0000 7f40f5a1f700 20 mds.0.log  calling mdcache->trim!
2023-12-07T00:26:45.410+0000 7f40f5a1f700  7 mds.0.cache trim bytes_used=1MB limit=4GB reservation=0.05% count=0
2023-12-07T00:26:45.410+0000 7f40f5a1f700  7 mds.0.cache trim_lru trimming 0 items from LRU size=915 mid=560 pintail=0 pinned=95
2023-12-07T00:26:45.410+0000 7f40f5a1f700 20 mds.0.cache bottom_lru: 0 items, 0 top, 0 bot, 0 pintail, 0 pinned
2023-12-07T00:26:45.410+0000 7f40f5a1f700 20 mds.0.cache lru: 915 items, 560 top, 355 bot, 0 pintail, 95 pinned
2023-12-07T00:26:45.410+0000 7f40f5a1f700  7 mds.0.cache trim_lru trimmed 0 items
2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 mds.0.log _replay_thread kicking waiters
2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 MDSContext::complete: 15C_MDS_BootStart
2023-12-07T00:26:45.410+0000 7f40f5a1f700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument
2023-12-07T00:26:45.410+0000 7f40f5a1f700  5 mds.beacon.i set_want_state: up:standby-replay -> down:damaged

This needs to be RCA'd.


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #59119: mds: segmentation fault during replay of snaptable updatesNewVenky Shankar

Actions
Actions

Also available in: Atom PDF