Bug #63806: ffsb.sh workunit failure (MDS: std::out_of_range, damaged) - CephFS - Ceph

Actions

Copy link

#1

Updated by Venky Shankar 5 months ago

Subject changed from ffsb.sh workunit failure to ffsb.sh workunit failure (std::out_of_range, damaged)

Actions

Copy link

#2

Updated by Venky Shankar 5 months ago

Subject changed from ffsb.sh workunit failure (std::out_of_range, damaged) to ffsb.sh workunit failure (MDS: std::out_of_range, damaged)

Actions

Copy link

#3

Updated by Patrick Donnelly 5 months ago

Related to Bug #59119: mds: segmentation fault during replay of snaptable updates added

Actions

Copy link

#4

Updated by Venky Shankar 5 months ago

Status changed from New to Triaged
Assignee set to Dhairya Parmar
Source set to Q/A

Actions

Copy link

#5

Updated by Patrick Donnelly 5 months ago

Category deleted (~~Correctness/Safety~~)
Assignee changed from Dhairya Parmar to Yehuda Sadeh
Target version deleted (~~v19.0.0~~)
Source deleted (~~Q/A~~)
Severity deleted (~~3 - minor~~)

Added #59119 which is similar in that `flush journal` causes standby-replay daemon to fail.

Actions

Copy link

#6

Updated by Venky Shankar 5 months ago

Patrick Donnelly wrote:

Added #59119 which is similar in that `flush journal` causes standby-replay daemon to fail.

We would also want to check if the active mds daemon has similar issues. I think it did.

Also, why reassign to Yehuda? :)

Actions

Copy link

#7

Updated by Dhairya Parmar 5 months ago

What I know till now,

mds.b is the active MDS with rank 0:

260477:2023-12-07T00:19:18.208+0000 7fa052b6a700  7 mon.a@0(leader).log v300 update_from_paxos applying incremental log 300 2023-12-07T00:19:17.276369+0000 mon.a (mon.0) 995 : cluster [INF] daemon mds.b is now active in filesystem cephfs as rank 0

this is the fsmap(sharing only the MDSs data):

max_mds    3
in    0,1,2
up    {0=24512,1=24479,2=24451}
failed    
damaged    
stopped    
data_pools    [3]
metadata_pool    2
inline_data    disabled
balancer    
bal_rank_mask    -1
standby_count_wanted    0
[mds.b{0:24512} state up:active seq 3 addr [v2:172.21.15.136:6838/2030257219,v1:172.21.15.136:6839/2030257219] compat {c=[1],r=[1],i=[fff]}]
[mds.i{0:24469} state up:standby-replay seq 1 addr [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] compat {c=[1],r=[1],i=[fff]}]
[mds.h{1:24479} state up:active seq 6 addr [v2:172.21.15.136:6836/3363137736,v1:172.21.15.136:6837/3363137736] compat {c=[1],r=[1],i=[fff]}]
[mds.e{1:24455} state up:standby-replay seq 1 addr [v2:172.21.15.136:6834/3499967199,v1:172.21.15.136:6835/3499967199] compat {c=[1],r=[1],i=[fff]}]
[mds.c{2:24451} state up:active seq 7 addr [v2:172.21.15.181:6836/3909565630,v1:172.21.15.181:6837/3909565630] compat {c=[1],r=[1],i=[fff]}]

Standby daemons:

[mds.j{-1:14574} state up:standby seq 1 addr [v2:172.21.15.105:6834/462434151,v1:172.21.15.105:6835/462434151] compat {c=[1],r=[1],i=[fff]}]
[mds.d{-1:14586} state up:standby seq 1 addr [v2:172.21.15.105:6836/2965410210,v1:172.21.15.105:6837/2965410210] compat {c=[1],r=[1],i=[fff]}]
[mds.g{-1:14649} state up:standby seq 1 addr [v2:172.21.15.105:6838/913925081,v1:172.21.15.105:6839/913925081] compat {c=[1],r=[1],i=[fff]}]
[mds.a{-1:14661} state up:standby seq 1 addr [v2:172.21.15.105:6840/2502623746,v1:172.21.15.105:6841/2502623746] compat {c=[1],r=[1],i=[fff]}]
[mds.l{-1:24391} state up:standby seq 1 addr [v2:172.21.15.181:6832/482210475,v1:172.21.15.181:6833/482210475] compat {c=[1],r=[1],i=[fff]}]
[mds.k{-1:24431} state up:standby seq 1 addr [v2:172.21.15.136:6832/572280608,v1:172.21.15.136:6833/572280608] compat {c=[1],r=[1],i=[fff]}]
[mds.f{-1:24439} state up:standby seq 1 addr [v2:172.21.15.181:6834/3575479535,v1:172.21.15.181:6835/3575479535] compat {c=[1],r=[1],i=[fff]}]

then mds.b's state gets null:

2023-12-07T00:26:45.444+0000 7f76427c6700 10 mds.b my gid is 24512
2023-12-07T00:26:45.444+0000 7f76427c6700 10 mds.b map says I am mds.-1.-1 state null
2023-12-07T00:26:45.444+0000 7f76427c6700 10 mds.b msgr says I am [v2:172.21.15.136:6838/2030257219,v1:172.21.15.136:6839/2030257219]
2023-12-07T00:26:45.444+0000 7f76427c6700  1 mds.b Map removed me [mds.b{0:24512} state up:active seq 3 export targets 1,2 addr [v2:172.21.15.136:6838/2030257219,v1:172.21.15.136:6839/2030257219] compat {c=[1],r=[1],i=[fff]}] from cluster; respawning! See cluster/monitor logs for details.
2023-12-07T00:26:45.444+0000 7f76427c6700  1 mds.b respawn!

immediately mon report MDS as damaged:

2023-12-07T00:26:45.722 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-274c4118-9495-11ee-95a2-87774f69a715-mon-a[98911]: 2023-12-07T00:26:45.416+0000 7fa05536f700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE)
2023-12-07T00:26:45.722 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-mon[98935]: pgmap v484: 97 pgs: 97 active+clean; 13 GiB data, 134 GiB used, 939 GiB / 1.0 TiB avail; 19 KiB/s rd, 113 MiB/s wr, 19.36k op/s
2023-12-07T00:26:45.722 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-mon[98935]: Error loading MDS rank 0: (22) Invalid argument
2023-12-07T00:26:45.722 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-mon[98935]: Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2023-12-07T00:26:45.722 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-mon[98935]: Health check failed: 1 mds daemon damaged (MDS_DAMAGE)
2023-12-07T00:26:45.722 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-mon[98935]: osdmap e86: 12 total, 12 up, 12 in
2023-12-07T00:26:45.722 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-mon[98935]: mds.? [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] down:damaged
2023-12-07T00:26:45.723 INFO:journalctl@ceph.mon.a.smithi105.stdout:Dec 07 00:26:45 smithi105 ceph-mon[98935]: fsmap cephfs:2/3 {1=h=up:active,2=c=up:active} 2 up:standby-replay 6 up:standby, 1 damaged
2023-12-07T00:26:45.729 INFO:journalctl@ceph.mds.i.smithi181.stdout:Dec 07 00:26:45 smithi181 ceph-274c4118-9495-11ee-95a2-87774f69a715-mds-i[141832]: 2023-12-07T00:26:45.410+0000 7f40f5a1f700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument
2023-12-07T00:26:45.729 INFO:journalctl@ceph.mds.i.smithi181.stdout:Dec 07 00:26:45 smithi181 ceph-274c4118-9495-11ee-95a2-87774f69a715-mds-i[141832]:    -14> 2023-12-07T00:26:45.410+0000 7f40f5a1f700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument

mds.i(which is the standby-replay daemon) is marked as damaged:

2023-12-07T00:26:45.410+0000 7f40f5a1f700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument
2023-12-07T00:26:45.410+0000 7f40f5a1f700  5 mds.beacon.i set_want_state: up:standby-replay -> down:damaged

and the MDSs status in fsmap is like this:

up    {1=24479,2=24451}
failed    
damaged    0
stopped    
data_pools    [3]
metadata_pool    2
inline_data    disabled
balancer    
bal_rank_mask    -1
standby_count_wanted    0
[mds.h{1:24479} state up:active seq 6 addr [v2:172.21.15.136:6836/3363137736,v1:172.21.15.136:6837/3363137736] compat {c=[1],r=[1],i=[fff]}]
[mds.e{1:24455} state up:standby-replay seq 1 addr [v2:172.21.15.136:6834/3499967199,v1:172.21.15.136:6835/3499967199] compat {c=[1],r=[1],i=[fff]}]
[mds.c{2:24451} state up:active seq 7 export targets 0,1 addr [v2:172.21.15.181:6836/3909565630,v1:172.21.15.181:6837/3909565630] compat {c=[1],r=[1],i=[fff]}]
[mds.f{2:24439} state up:standby-replay seq 1 addr [v2:172.21.15.181:6834/3575479535,v1:172.21.15.181:6835/3575479535] compat {c=[1],r=[1],i=[fff]}]

Standby daemons:

[mds.j{-1:14574} state up:standby seq 1 addr [v2:172.21.15.105:6834/462434151,v1:172.21.15.105:6835/462434151] compat {c=[1],r=[1],i=[fff]}]
[mds.d{-1:14586} state up:standby seq 1 addr [v2:172.21.15.105:6836/2965410210,v1:172.21.15.105:6837/2965410210] compat {c=[1],r=[1],i=[fff]}]
[mds.g{-1:14649} state up:standby seq 1 addr [v2:172.21.15.105:6838/913925081,v1:172.21.15.105:6839/913925081] compat {c=[1],r=[1],i=[fff]}]
[mds.a{-1:14661} state up:standby seq 1 addr [v2:172.21.15.105:6840/2502623746,v1:172.21.15.105:6841/2502623746] compat {c=[1],r=[1],i=[fff]}]
[mds.l{-1:24391} state up:standby seq 1 addr [v2:172.21.15.181:6832/482210475,v1:172.21.15.181:6833/482210475] compat {c=[1],r=[1],i=[fff]}]
[mds.k{-1:24431} state up:standby seq 1 addr [v2:172.21.15.136:6832/572280608,v1:172.21.15.136:6833/572280608] compat {c=[1],r=[1],i=[fff]}]

and this stays like this i.e. there are active MDSs but none at rank 0 i.e. no rank failover takes place which is strange and that's why since the ceph tell cmd is targeted for rank 0, it always fails.

The reason why the active mds at rank 0 goes silent is not yet clear from the mds logs.

However there is this trace in mds.i which is the standby-replay daemon:

2023-12-07T00:26:44.960+0000 7f40f7222700  0 mds.24469.journaler.mdlog(ro) _finish_read got less than expected (4149413)

and the logs are filled with this line:

2023-12-07T00:26:25.387+0000 7f4100234700  1 -- [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] <== osd.8 v2:172.21.15.181:6816/602100546 173 ==== osd_op_reply(1141 200.00000003 [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 156+0+0 (crc 0 0 0) 0x555d6c9aafc0 con 0x555d6d7d0400

which might indicate that something is wrong with the on-disk structures and that's why we see the read is less than expected and we see "got less than expected" in the logs. Can this lead to MDS crashing?

Actions

Copy link

#8

Updated by Dhairya Parmar 5 months ago

Assignee changed from Yehuda Sadeh to Dhairya Parmar

Actions

Copy link

#9

Updated by Venky Shankar 5 months ago

Dhairya Parmar wrote:

What I know till now,

mds.b is the active MDS with rank 0:
[...]

this is the fsmap(sharing only the MDSs data):
[...]

then mds.b's state gets null:

Actually what happened is that the standby=replay MDS following rank-0 is marked as damaged:

2023-12-07T00:26:45.415+0000 7fa05536f700 20 MonCap is_capable service=mds command= exec addr v2:172.21.15.181:6838/824276348 on cap allow profile mds
2023-12-07T00:26:45.415+0000 7fa05536f700 20 MonCap  allow so far , doing grant allow profile mds
2023-12-07T00:26:45.415+0000 7fa05536f700 20 MonCap  match
2023-12-07T00:26:45.415+0000 7fa05536f700  5 mon.a@0(leader).mds e31 preprocess_beacon mdsbeacon(24469/i down:damaged seq=117 v31) v8 from mds.? [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2023-12-07T00:26:45.415+0000 7fa05536f700 10 mon.a@0(leader).mds e31 preprocess_beacon: GID exists in map: 24469
2023-12-07T00:26:45.415+0000 7fa05536f700  5 mon.a@0(leader).mds e31 _note_beacon mdsbeacon(24469/i down:damaged seq=117 v31) v8 noting time
2023-12-07T00:26:45.415+0000 7fa05536f700  7 mon.a@0(leader).mds e31 prepare_update mdsbeacon(24469/i down:damaged seq=117 v31) v8
2023-12-07T00:26:45.415+0000 7fa05536f700 12 mon.a@0(leader).mds e31 prepare_beacon mdsbeacon(24469/i down:damaged seq=117 v31) v8 from mds.? [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348]

and then the mon blocklists the active rank-0

2023-12-07T00:26:45.415+0000 7fa05536f700  0 mon.a@0(leader).mds e31 prepare_beacon: marking rank 0 damaged
2023-12-07T00:26:45.415+0000 7fa05536f700 10 mon.a@0(leader).osd e85 blocklist [v2:172.21.15.136:6838/2030257219,v1:172.21.15.136:6839/2030257219] until 2023-12-08T00:26:45.416706+0000

Therefore, mds.b is removed from the MDSMap and thereby respawns. The standby-replay daemon marks itself as damaged as per:

2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 mds.0.log  maybe trim LogSegment(3824/0xc06c0c events=8)
2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 mds.0.log  won't remove, not expired!
2023-12-07T00:26:45.410+0000 7f40f5a1f700 20 mds.0.log  calling mdcache->trim!
2023-12-07T00:26:45.410+0000 7f40f5a1f700  7 mds.0.cache trim bytes_used=1MB limit=4GB reservation=0.05% count=0
2023-12-07T00:26:45.410+0000 7f40f5a1f700  7 mds.0.cache trim_lru trimming 0 items from LRU size=915 mid=560 pintail=0 pinned=95
2023-12-07T00:26:45.410+0000 7f40f5a1f700 20 mds.0.cache bottom_lru: 0 items, 0 top, 0 bot, 0 pintail, 0 pinned
2023-12-07T00:26:45.410+0000 7f40f5a1f700 20 mds.0.cache lru: 915 items, 560 top, 355 bot, 0 pintail, 95 pinned
2023-12-07T00:26:45.410+0000 7f40f5a1f700  7 mds.0.cache trim_lru trimmed 0 items
2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 mds.0.log _replay_thread kicking waiters
2023-12-07T00:26:45.410+0000 7f40f5a1f700 10 MDSContext::complete: 15C_MDS_BootStart
2023-12-07T00:26:45.410+0000 7f40f5a1f700 -1 log_channel(cluster) log [ERR] : Error loading MDS rank 0: (22) Invalid argument
2023-12-07T00:26:45.410+0000 7f40f5a1f700  5 mds.beacon.i set_want_state: up:standby-replay -> down:damaged

The underlying reason for that being:

2023-12-07T00:26:44.881+0000 7f40f7222700 20 mds.0.0 _standby_replay_restart_finish: old_read_pos=12627803 trimmed_pos=8388608
2023-12-07T00:26:44.881+0000 7f40f7222700 10 mds.0.log standby_trim_segments
2023-12-07T00:26:44.881+0000 7f40f7222700 10 mds.0.log  expire_pos=12525334
2023-12-07T00:26:44.881+0000 7f40f7222700 10 mds.0.log  maybe trim LogSegment(3770/0xbf1f16 events=36)
2023-12-07T00:26:44.881+0000 7f40f7222700 10 mds.0.log  won't remove, not expired!
2023-12-07T00:26:44.881+0000 7f40f7222700 20 mds.0.log  removed no segments!
2023-12-07T00:26:44.881+0000 7f40f7222700  2 mds.0.0 Booting: 2: replaying mds log
2023-12-07T00:26:44.881+0000 7f40f7222700 10 mds.0.log replay start, from 12627803 to 16777216
2023-12-07T00:26:44.881+0000 7f40f5a1f700 10 mds.0.log _replay_thread start
2023-12-07T00:26:44.881+0000 7f40f5a1f700  1 -- [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] --> [v2:172.21.15.181:6816/602100546,v1:172.21.15.181:6817/602100546] -- osd_op(unknown.0.0:1178 2.b 2:d5c7a900:::200.00000003:head [read 44891~4149413 [fadvi
se_dontneed]] snapc 0=[] ondisk+read+known_if_redirected+full_force+supports_pool_eio e85) v8 -- 0x555d6d62e380 con 0x555d6d7d0400
2023-12-07T00:26:44.960+0000 7f4100234700  1 -- [v2:172.21.15.181:6838/824276348,v1:172.21.15.181:6839/824276348] <== osd.8 v2:172.21.15.181:6816/602100546 187 ==== osd_op_reply(1178 200.00000003 [read 44891~8887 [fadvise_dontneed] out=8887b] v0'0 uv35 ondisk = 0) v8 ====
 156+0+8887 (crc 0 0 0) 0x555d6c9aafc0 con 0x555d6d7d0400
2023-12-07T00:26:44.960+0000 7f40f7222700  0 mds.24469.journaler.mdlog(ro) _finish_read got less than expected (4149413)
2023-12-07T00:26:44.960+0000 7f40f5a1f700  0 mds.0.log _replay journaler got error -22, aborting

Actions

Copy link

#10

Updated by Dhairya Parmar 5 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

What I know till now,

mds.b is the active MDS with rank 0:
[...]

this is the fsmap(sharing only the MDSs data):
[...]

then mds.b's state gets null:

Actually what happened is that the standby=replay MDS following rank-0 is marked as damaged:

[...]

and then the mon blocklists the active rank-0

[...]

Therefore, mds.b is removed from the MDSMap and thereby respawns. The standby-replay daemon marks itself as damaged as per:

[...]

The underlying reason for that being:

[...]

Thanks for the insights venky, indeed it was quite clear that the reading failure was somehow linked to the MDS getting nuked but didn't know the active mds could get blacklisted if the standby-replay daemon is damaged but then the question arises that why there was no daemon promoted up to the rank 0? From the logs, mds.b (i.e the one blacklisted) is in up:standby state but neither it or other daemon get to the rank 0, isn't this strange or is it normal to have active MDSs but none at rank 0? Shouldn't we have a provision that allows routing the cmd req to other active MDS (if there is any)?

Actions

Copy link

#11

Updated by Venky Shankar 4 months ago

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

What I know till now,

mds.b is the active MDS with rank 0:
[...]

this is the fsmap(sharing only the MDSs data):
[...]

then mds.b's state gets null:

Actually what happened is that the standby=replay MDS following rank-0 is marked as damaged:

[...]

and then the mon blocklists the active rank-0

[...]

Therefore, mds.b is removed from the MDSMap and thereby respawns. The standby-replay daemon marks itself as damaged as per:

[...]

The underlying reason for that being:

[...]

Thanks for the insights venky, indeed it was quite clear that the reading failure was somehow linked to the MDS getting nuked but didn't know the active mds could get blacklisted if the standby-replay daemon is damaged but then the question arises that why there was no daemon promoted up to the rank 0? From the logs, mds.b (i.e the one blacklisted) is in up:standby state but neither it or other daemon get to the rank 0, isn't this strange or is it normal to have active MDSs but none at rank 0?

The standby-replay daemon marks itself as damaged and then the active MDS is blocklisted. When another MDS transitions to active (rank-0) the mon will blocklist it yet again. So, there won't be an active MDS rank for the file system.

Actions

Copy link

#12

Updated by Venky Shankar 4 months ago

So, this tracker can be closed since the underlying problem is the standby-replay daemon marking itself as damaged due to https://tracker.ceph.com/issues/59119

Actions

Copy link

#13

Updated by Venky Shankar 4 months ago

Status changed from Triaged to Closed

Actions

Copy link

#14

Updated by Dhairya Parmar 4 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

What I know till now,

mds.b is the active MDS with rank 0:
[...]

this is the fsmap(sharing only the MDSs data):
[...]

then mds.b's state gets null:

Actually what happened is that the standby=replay MDS following rank-0 is marked as damaged:

[...]

and then the mon blocklists the active rank-0

[...]

Therefore, mds.b is removed from the MDSMap and thereby respawns. The standby-replay daemon marks itself as damaged as per:

[...]

The underlying reason for that being:

[...]

Thanks for the insights venky, indeed it was quite clear that the reading failure was somehow linked to the MDS getting nuked but didn't know the active mds could get blacklisted if the standby-replay daemon is damaged but then the question arises that why there was no daemon promoted up to the rank 0? From the logs, mds.b (i.e the one blacklisted) is in up:standby state but neither it or other daemon get to the rank 0, isn't this strange or is it normal to have active MDSs but none at rank 0?

The standby-replay daemon marks itself as damaged and then the active MDS is blocklisted. When another MDS transitions to active (rank-0) the mon will blocklist it yet again.

yet again? i'm confused, so we have a never ending loop of blocklisting MDSs? Is this normal?

So, there won't be an active MDS rank for the file system.

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #63806

ffsb.sh workunit failure (MDS: std::out_of_range, damaged)

Updated by Venky Shankar 5 months ago

Updated by Venky Shankar 5 months ago

Updated by Patrick Donnelly 5 months ago

Updated by Venky Shankar 5 months ago

Updated by Patrick Donnelly 5 months ago

Updated by Venky Shankar 5 months ago

Updated by Dhairya Parmar 5 months ago

Updated by Dhairya Parmar 5 months ago

Updated by Venky Shankar 5 months ago

Updated by Dhairya Parmar 5 months ago

Updated by Venky Shankar 4 months ago

Updated by Venky Shankar 4 months ago

Updated by Venky Shankar 4 months ago

Updated by Dhairya Parmar 4 months ago