Actions
Bug #59119
openmds: segmentation fault during replay of snaptable updates
Status:
New
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Q/A
Tags:
Backport:
quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
For a standby-replay daemon:
023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.log _replay 4199266~121 / 4200313 2023-03-17T00:50:00.527030+0000: ETableServer snaptable prepare reqid 2 mds.0 tid 2 version 2 mutation=57 bytes 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal ETableServer.replay snaptable prepare event 2 - 1 == table 1 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.snap prepare v2 create snap(2 0x1 'scheduled-2023-03-17-00_50_00_UTC' 2023-03-17T00:50:00.526971+0000) 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.log _replay 4199407~886 / 4200313 2023-03-17T00:50:00.529078+0000: EUpdate mksnap [metablob table_tids=[^A,2]] 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal EMetaBlob.replay 0 dirlumps by unknown.0 2023-03-17T00:50:01.288+0000 7f7d86761700 20 mds.0.cache.ino(0x1) decode_snap_blob snaprealm(0x1 seq 2 lc 2 cr 2 cps 1 snaps={2=snap(2 0x1 'scheduled-2023-03-17-00_50_00_UTC' 2023-03-17T00:50:00.526288+0000)} last_modified 2023-03-17T00:50:00.526288+0000 change_attr 1 0x560d0156a480) 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal EMetaBlob.replay updated root [inode 0x1 [...2,head] / auth v4 snaprealm=0x560d0156a480 f(v0 m2023-03-17T00:49:55.917230+0000 1=0+1) n(v0 rc2023-03-17T00:50:00.526288+0000 rs1 2=0+2)/n(v0 rc2023-03-17T00:28:01.024379+0000 1=0+1) (iversion lock) | dirfrag=1 dirty=1 0x560d01567600] 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal EMetaBlob.replay noting snaptable transaction 2 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.tableclient(snaptable) got_journaled_agree 2 2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.snapclient notify_commit tid 2 ... 2023-03-17T00:50:03.292+0000 7f7d86761700 10 mds.0.log _replay 4200313~64 / 4203576 2023-03-17T00:50:00.531089+0000: ETableServer snaptable commit tid 2 version 3 2023-03-17T00:50:03.292+0000 7f7d86761700 10 mds.0.journal ETableServer.replay snaptable commit event 3 - 1 == table 2 2023-03-17T00:50:03.292+0000 7f7d86761700 7 mds.0.snap commit 2 create snap(2 0x1 'scheduled-2023-03-17-00_50_00_UTC' 2023-03-17T00:50:00.526971+0000) ... 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 MDSContext::complete: N7MDSRank32C_MDS_StandbyReplayRestartFinishE 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log standby_trim_segments 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log expire_pos=4202684 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log segment seq=1 4194304~8380 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log removing segment 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.dir(0x10000000000) mark_clean [dir 0x10000000000 /client.0/ [2,head] auth v=1 cv=0/0 state=1610612737|complete f() n() hs=0+0,ss=0+0 | dirty=1 0x560d00601600] version 1 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.dir(0x1) mark_clean [dir 0x1 / [2,head] auth v=5 cv=0/0 dir_auth=0 state=1610612736 f(v0 m2023-03-17T00:49:55.917230+0000 1=0+1) n(v0 rc2023-03-17T00:49:55.922230+0000 1=0+1) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 0x560d00600d00] version 5 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.ino(0x1) mark_clean [inode 0x1 [...3,head] / auth v5 snaprealm=0x560d0156a480 f(v0 m2023-03-17T00:49:55.917230+0000 1=0+1) n(v0 rc2023-03-17T00:50:00.526288+0000 rs1 2=0+2)/n(v0 rc2023-03-17T00:28:01.024379+0000 1=0+1) old_inodes=1 (iversion lock) | dirfrag=1 dirty=1 0x560d01567600] 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.ino(0x10000000000) mark_clean [inode 0x10000000000 [...2,head] /client.0/ auth v4 DIRTYPARENT f() n(v0 rc2023-03-17T00:49:55.922230+0000 1=0+1) (iversion lock) | dirfrag=0 dirtyparent=1 dirty=1 0x560d01568c00] 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.den(0x1 client.0) mark_clean [dentry #0x1/client.0 [2,head] auth (dversion lock) v=4 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x560d012e8500] 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.ino(0x10000000000) clear_dirty_parent 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log segment seq=12 4202684~892 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log won't remove, not expired! 2023-03-17T00:50:04.293+0000 7f7d87f64700 20 mds.0.log calling mdcache->trim! 2023-03-17T00:50:04.293+0000 7f7d87f64700 7 mds.0.cache trim bytes_used=14kB limit=4GB reservation=0.05% count=0 2023-03-17T00:50:04.293+0000 7f7d87f64700 7 mds.0.cache trim_lru trimming 0 items from LRU size=1 mid=0 pintail=0 pinned=0 2023-03-17T00:50:04.293+0000 7f7d87f64700 7 mds.0.cache trim_lru trimmed 0 items 2023-03-17T00:50:04.293+0000 7f7d87f64700 2 mds.0.0 Booting: 2: replaying mds log 2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log replay start, from 4203576 to 4203636 2023-03-17T00:50:04.293+0000 7f7d86761700 10 mds.0.log _replay_thread start 2023-03-17T00:50:04.293+0000 7f7d86761700 1 -- [v2:172.21.15.146:6838/2651546611,v1:172.21.15.146:6839/2651546611] --> [v2:172.21.15.146:6824/672977052,v1:172.21.15.146:6825/672977052] -- osd_op(unknown.0.0:2692 2.14 2:2e2fa760:::200.00000001:head [read 9272~60 [fadvise_dontneed]] snapc 0=[] ondisk+read+known_if_redirected+full_force+supports_pool_eio e91) v8 -- 0x560d015cb800 con 0x560d01536c00 2023-03-17T00:50:04.294+0000 7f7d91777700 1 -- [v2:172.21.15.146:6838/2651546611,v1:172.21.15.146:6839/2651546611] <== osd.10 v2:172.21.15.146:6824/672977052 2512 ==== osd_op_reply(2692 200.00000001 [read 9272~60 [fadvise_dontneed] out=60b] v0'0 uv22 ondisk = 0) v8 ==== 156+0+60 (crc 0 0 0) 0x560d0156a000 con 0x560d01536c00 2023-03-17T00:50:04.294+0000 7f7d86761700 10 mds.0.log _replay 4203576~40 / 4203636 2023-03-17T00:50:02.359670+0000: ETableClient snaptable ack tid 2 2023-03-17T00:50:04.294+0000 7f7d86761700 10 mds.0.journal ETableClient.replay snaptable op ack tid 2 2023-03-17T00:50:04.294+0000 7f7d86761700 10 mds.0.tableclient(snaptable) got_journaled_ack 2
From: /teuthology/pdonnell-2023-03-16_23:18:17-fs:workload-wip-pdonnell-testing-20230316.214637-distro-default-smithi/7211196/remote/smithi146/log/1a80245e-c457-11ed-9afb-001a4aab830c/ceph-mds.b.log.gz
The LogSegment storing the pending commit tids was expired/removed. This happened because rank 0 received a "flush journal" request (part of the new snapshot stress testing in fs:workload). The standby-replay daemon wrongly thinks it can continue replay when all of its segments in memory have been expired.
Actions