Bug #53194
closedmds: opening connection to up:replay/up:creating daemon causes message drop
0%
Description
Found a QA run where MDS was stuck in up:resolve:
This occurs in a multimds cluster. Cause is the other active MDS is dropping the new MDS's messages:
2021-11-05T20:08:26.796+0000 7fb4eae68700 1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0 2021-11-05T20:08:26.796+0000 7fb4eae68700 1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 crc :-1 s=READY pgs=6 cs=0 l=0 rev1=1 rx=0 tx=0).ready entity=mds.? client_cookie=25cbe7aa447d9f35 server_cookie=33eddd17bae5e981 in_seq=0 out_seq=0 ... 2021-11-05T20:08:31.634+0000 7fb4e8663700 1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 1 ==== mdsmap(e 21) v2 ==== 933+0+0 (crc 0 0 0) 0x562dc51424e0 con 0x562dc992c400 2021-11-05T20:08:31.634+0000 7fb4e8663700 5 mds.cephfs.smithi098.pucypu handle_mds_map old map epoch 21 <= 21, discarding 2021-11-05T20:08:31.634+0000 7fb4e8663700 1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 2 ==== mds_table_request(snaptable server_ready) v1 ==== 16+0+0 (crc 0 0 0) 0x562dc95be300 con 0x562dc992c400 2021-11-05T20:08:31.634+0000 7fb4e8663700 5 mds.1.6 got mds_table_request(snaptable server_ready) v1 from down/old/bad/imposter mds mds.?, dropping 2021-11-05T20:08:31.634+0000 7fb4e8663700 1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 3 ==== mds_resolve(2+0 subtrees +0 peer requests) v1 ==== 89+0+0 (crc 0 0 0) 0x562dcc33e5a0 con 0x562dc992c400 2021-11-05T20:08:31.634+0000 7fb4e8663700 5 mds.1.6 got mds_resolve(2+0 subtrees +0 peer requests) v1 from down/old/bad/imposter mds mds.?, dropping
From: /ceph/teuthology-archive/pdonnell-2021-11-05_19:13:39-fs:upgrade-wip-pdonnell-testing-20211105.172813-distro-basic-smithi/6488028/remote/smithi098/log/cb9d093a-3e72-11ec-8c28-001a4aab830c/ceph-mds.cephfs.smithi098.pucypu.log-20211106.gz
rank 1 opened a connection with rank 0 when rank 0 was up:replay. This occurred before rank 0 was able to process its state change from mdsmap e19 and update its "myname" with the messenger:
Messenger ProtocolV2 now associates the daemon type / rank at connection creation so any updates by rank0 to its name are no longer propagated to its peers.
I have a reproducer working. I am in the process of finishing the fix.
Updated by Patrick Donnelly over 2 years ago
- Category set to Correctness/Safety
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 43850
Updated by Venky Shankar over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 2 years ago
- Copied to Backport #53445: pacific: mds: opening connection to up:replay/up:creating daemon causes message drop added
Updated by Backport Bot over 2 years ago
- Copied to Backport #53446: octopus: mds: opening connection to up:replay/up:creating daemon causes message drop added
Updated by Patrick Donnelly about 1 year ago
- Status changed from Pending Backport to Resolved