Bug #59119: mds: segmentation fault during replay of snaptable updates - CephFS - Ceph

Actions

Copy link

Bug #59119

open

mds: segmentation fault during replay of snaptable updates

Added by Patrick Donnelly about 1 year ago. Updated 5 months ago.

Status:

New

Priority:

High

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Q/A

Tags:

Backport:

quincy,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

For a standby-replay daemon:

023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.log _replay 4199266~121 / 4200313 2023-03-17T00:50:00.527030+0000: ETableServer snaptable prepare reqid 2 mds.0 tid 2 version 2 mutation=57 bytes
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal  ETableServer.replay snaptable prepare event 2 - 1 == table 1
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.snap prepare v2 create snap(2 0x1 'scheduled-2023-03-17-00_50_00_UTC' 2023-03-17T00:50:00.526971+0000)
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.log _replay 4199407~886 / 4200313 2023-03-17T00:50:00.529078+0000: EUpdate mksnap [metablob table_tids=[^A,2]]
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal EMetaBlob.replay 0 dirlumps by unknown.0
2023-03-17T00:50:01.288+0000 7f7d86761700 20 mds.0.cache.ino(0x1) decode_snap_blob snaprealm(0x1 seq 2 lc 2 cr 2 cps 1 snaps={2=snap(2 0x1 'scheduled-2023-03-17-00_50_00_UTC' 2023-03-17T00:50:00.526288+0000)} last_modified 2023-03-17T00:50:00.526288+0000 change_attr 1 0x560d0156a480)
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal EMetaBlob.replay  updated root [inode 0x1 [...2,head] / auth v4 snaprealm=0x560d0156a480 f(v0 m2023-03-17T00:49:55.917230+0000 1=0+1) n(v0 rc2023-03-17T00:50:00.526288+0000 rs1 2=0+2)/n(v0 rc2023-03-17T00:28:01.024379+0000 1=0+1) (iversion lock) | dirfrag=1 dirty=1 0x560d01567600]
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.journal EMetaBlob.replay noting snaptable transaction 2
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.tableclient(snaptable) got_journaled_agree 2
2023-03-17T00:50:01.288+0000 7f7d86761700 10 mds.0.snapclient notify_commit tid 2
...
2023-03-17T00:50:03.292+0000 7f7d86761700 10 mds.0.log _replay 4200313~64 / 4203576 2023-03-17T00:50:00.531089+0000: ETableServer snaptable commit tid 2 version 3
2023-03-17T00:50:03.292+0000 7f7d86761700 10 mds.0.journal  ETableServer.replay snaptable commit event 3 - 1 == table 2
2023-03-17T00:50:03.292+0000 7f7d86761700  7 mds.0.snap commit 2 create snap(2 0x1 'scheduled-2023-03-17-00_50_00_UTC' 2023-03-17T00:50:00.526971+0000)
...
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 MDSContext::complete: N7MDSRank32C_MDS_StandbyReplayRestartFinishE
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log standby_trim_segments
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log  expire_pos=4202684
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log  segment seq=1 4194304~8380
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log  removing segment
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.dir(0x10000000000) mark_clean [dir 0x10000000000 /client.0/ [2,head] auth v=1 cv=0/0 state=1610612737|complete f() n() hs=0+0,ss=0+0 | dirty=1 0x560d00601600] version 1
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.dir(0x1) mark_clean [dir 0x1 / [2,head] auth v=5 cv=0/0 dir_auth=0 state=1610612736 f(v0 m2023-03-17T00:49:55.917230+0000 1=0+1) n(v0 rc2023-03-17T00:49:55.922230+0000 1=0+1) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 0x560d00600d00] version 5
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.ino(0x1) mark_clean [inode 0x1 [...3,head] / auth v5 snaprealm=0x560d0156a480 f(v0 m2023-03-17T00:49:55.917230+0000 1=0+1) n(v0 rc2023-03-17T00:50:00.526288+0000 rs1 2=0+2)/n(v0 rc2023-03-17T00:28:01.024379+0000 1=0+1) old_inodes=1 (iversion lock) | dirfrag=1 dirty=1 0x560d01567600]
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.ino(0x10000000000) mark_clean [inode 0x10000000000 [...2,head] /client.0/ auth v4 DIRTYPARENT f() n(v0 rc2023-03-17T00:49:55.922230+0000 1=0+1) (iversion lock) | dirfrag=0 dirtyparent=1 dirty=1 0x560d01568c00]
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.den(0x1 client.0) mark_clean [dentry #0x1/client.0 [2,head] auth (dversion lock) v=4 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x560d012e8500]
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.cache.ino(0x10000000000) clear_dirty_parent
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log  segment seq=12 4202684~892
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log  won't remove, not expired!
2023-03-17T00:50:04.293+0000 7f7d87f64700 20 mds.0.log  calling mdcache->trim!
2023-03-17T00:50:04.293+0000 7f7d87f64700  7 mds.0.cache trim bytes_used=14kB limit=4GB reservation=0.05% count=0
2023-03-17T00:50:04.293+0000 7f7d87f64700  7 mds.0.cache trim_lru trimming 0 items from LRU size=1 mid=0 pintail=0 pinned=0
2023-03-17T00:50:04.293+0000 7f7d87f64700  7 mds.0.cache trim_lru trimmed 0 items
2023-03-17T00:50:04.293+0000 7f7d87f64700  2 mds.0.0 Booting: 2: replaying mds log
2023-03-17T00:50:04.293+0000 7f7d87f64700 10 mds.0.log replay start, from 4203576 to 4203636
2023-03-17T00:50:04.293+0000 7f7d86761700 10 mds.0.log _replay_thread start
2023-03-17T00:50:04.293+0000 7f7d86761700  1 -- [v2:172.21.15.146:6838/2651546611,v1:172.21.15.146:6839/2651546611] --> [v2:172.21.15.146:6824/672977052,v1:172.21.15.146:6825/672977052] -- osd_op(unknown.0.0:2692 2.14 2:2e2fa760:::200.00000001:head [read 9272~60 [fadvise_dontneed]] snapc 0=[] ondisk+read+known_if_redirected+full_force+supports_pool_eio e91) v8 -- 0x560d015cb800 con 0x560d01536c00
2023-03-17T00:50:04.294+0000 7f7d91777700  1 -- [v2:172.21.15.146:6838/2651546611,v1:172.21.15.146:6839/2651546611] <== osd.10 v2:172.21.15.146:6824/672977052 2512 ==== osd_op_reply(2692 200.00000001 [read 9272~60 [fadvise_dontneed] out=60b] v0'0 uv22 ondisk = 0) v8 ==== 156+0+60 (crc 0 0 0) 0x560d0156a000 con 0x560d01536c00
2023-03-17T00:50:04.294+0000 7f7d86761700 10 mds.0.log _replay 4203576~40 / 4203636 2023-03-17T00:50:02.359670+0000: ETableClient snaptable ack tid 2
2023-03-17T00:50:04.294+0000 7f7d86761700 10 mds.0.journal  ETableClient.replay snaptable op ack tid 2
2023-03-17T00:50:04.294+0000 7f7d86761700 10 mds.0.tableclient(snaptable) got_journaled_ack 2

From: /teuthology/pdonnell-2023-03-16_23:18:17-fs:workload-wip-pdonnell-testing-20230316.214637-distro-default-smithi/7211196/remote/smithi146/log/1a80245e-c457-11ed-9afb-001a4aab830c/ceph-mds.b.log.gz

The LogSegment storing the pending commit tids was expired/removed. This happened because rank 0 received a "flush journal" request (part of the new snapshot stress testing in fs:workload). The standby-replay daemon wrongly thinks it can continue replay when all of its segments in memory have been expired.

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly about 1 year ago

Pull request ID set to 50615

Actions

Copy link

Updated by Patrick Donnelly 6 months ago

Status changed from In Progress to New
Assignee deleted (~~Patrick Donnelly~~)
Target version deleted (~~v19.0.0~~)
Backport deleted (~~reef,quincy,pacific~~)

Actions

Copy link

Updated by Patrick Donnelly 6 months ago

Pull request ID deleted (~~50615~~)

Actions

Copy link

Updated by Venky Shankar 6 months ago

Patrick, https://github.com/ceph/ceph/pull/50615 doesn't really fix the underlying issues and is just a fail safe approach to restart the s-r daemon, correct? We do want to backport the change I guess??

Actions

Copy link

Updated by Patrick Donnelly 6 months ago

Venky Shankar wrote:

Patrick, https://github.com/ceph/ceph/pull/50615 doesn't really fix the underlying issues and is just a fail safe approach to restart the s-r daemon, correct? We do want to backport the change I guess??

It doesn't change the behavior at all. It just adds extra debugging.

Yes, the current approach is to just restart the s-r daemon. It's mostly just an annoyance we see in QA because most folks are not sending "flush journal" commands to their MDS routinely.

Actions

Copy link

Updated by Patrick Donnelly 6 months ago

Patrick Donnelly wrote:

Venky Shankar wrote:

We do want to backport the change I guess??

No need. I think it may be helpful for debugging the next time we see it in main QA.

Actions

Copy link

Updated by Patrick Donnelly 5 months ago

Related to Bug #54741: crash: MDSTableClient::got_journaled_ack(unsigned long) added

Actions

Copy link

Updated by Venky Shankar 5 months ago

https://tracker.ceph.com/issues/54741 was seen in a cluster where it's unlikely that "flush journal" was run. It's possible that the standby-replay daemon replaying logic isn't handling certain edge cases which could have lead to this.

Actions

Copy link