Bug #53194: mds: opening connection to up:replay/up:creating daemon causes message drop - CephFS - Ceph

Actions

Copy link

Bug #53194

closed

mds: opening connection to up:replay/up:creating daemon causes message drop

Added by Patrick Donnelly over 2 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Patrick Donnelly

Category:

Correctness/Safety

Target version:

Ceph - v17.0.0

% Done:

Source:

Q/A

Tags:

backport_processed

Backport:

pacific,octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds, qa, qa-failure, task(medium)

Pull request ID:

43850

Crash signature (v1):

Crash signature (v2):

Description

Found a QA run where MDS was stuck in up:resolve:

https://pulpito.ceph.com/pdonnell-2021-11-05_19:13:39-fs:upgrade-wip-pdonnell-testing-20211105.172813-distro-basic-smithi/6488028/

This occurs in a multimds cluster. Cause is the other active MDS is dropping the new MDS's messages:

2021-11-05T20:08:26.796+0000 7fb4eae68700  1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2021-11-05T20:08:26.796+0000 7fb4eae68700  1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 crc :-1 s=READY pgs=6 cs=0 l=0 rev1=1 rx=0 tx=0).ready entity=mds.? client_cookie=25cbe7aa447d9f35 server_cookie=33eddd17bae5e981 in_seq=0 out_seq=0
...
2021-11-05T20:08:31.634+0000 7fb4e8663700  1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 1 ==== mdsmap(e 21) v2 ==== 933+0+0 (crc 0 0 0) 0x562dc51424e0 con 0x562dc992c400
2021-11-05T20:08:31.634+0000 7fb4e8663700  5 mds.cephfs.smithi098.pucypu handle_mds_map old map epoch 21 <= 21, discarding
2021-11-05T20:08:31.634+0000 7fb4e8663700  1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 2 ==== mds_table_request(snaptable server_ready) v1 ==== 16+0+0 (crc 0 0 0) 0x562dc95be300 con 0x562dc992c400
2021-11-05T20:08:31.634+0000 7fb4e8663700  5 mds.1.6 got mds_table_request(snaptable server_ready) v1 from down/old/bad/imposter mds mds.?, dropping
2021-11-05T20:08:31.634+0000 7fb4e8663700  1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 3 ==== mds_resolve(2+0 subtrees +0 peer requests) v1 ==== 89+0+0 (crc 0 0 0) 0x562dcc33e5a0 con 0x562dc992c400
2021-11-05T20:08:31.634+0000 7fb4e8663700  5 mds.1.6 got mds_resolve(2+0 subtrees +0 peer requests) v1 from down/old/bad/imposter mds mds.?, dropping

From: /ceph/teuthology-archive/pdonnell-2021-11-05_19:13:39-fs:upgrade-wip-pdonnell-testing-20211105.172813-distro-basic-smithi/6488028/remote/smithi098/log/cb9d093a-3e72-11ec-8c28-001a4aab830c/ceph-mds.cephfs.smithi098.pucypu.log-20211106.gz

rank 1 opened a connection with rank 0 when rank 0 was up:replay. This occurred before rank 0 was able to process its state change from mdsmap e19 and update its "myname" with the messenger:

https://github.com/ceph/ceph/blob/fb8671c5733dc4dfed79e42deafd33c46e78c519/src/mds/MDSRank.cc#L2250-L2257

Messenger ProtocolV2 now associates the daemon type / rank at connection creation so any updates by rank0 to its name are no longer propagated to its peers.

I have a reproducer working. I am in the process of finishing the fix.

Related issues 2 (0 open — 2 closed)