Project

General

Profile

Actions

Bug #53194

closed

mds: opening connection to up:replay/up:creating daemon causes message drop

Added by Patrick Donnelly over 2 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds, qa, qa-failure, task(medium)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Found a QA run where MDS was stuck in up:resolve:

https://pulpito.ceph.com/pdonnell-2021-11-05_19:13:39-fs:upgrade-wip-pdonnell-testing-20211105.172813-distro-basic-smithi/6488028/

This occurs in a multimds cluster. Cause is the other active MDS is dropping the new MDS's messages:

2021-11-05T20:08:26.796+0000 7fb4eae68700  1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2021-11-05T20:08:26.796+0000 7fb4eae68700  1 --2- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] >> [v2:172.21.15.125:6826/1162235965,v1:172.21.15.125:6827/1162235965] conn(0x562dc992c400 0x562dcc09f400 crc :-1 s=READY pgs=6 cs=0 l=0 rev1=1 rx=0 tx=0).ready entity=mds.? client_cookie=25cbe7aa447d9f35 server_cookie=33eddd17bae5e981 in_seq=0 out_seq=0
...
2021-11-05T20:08:31.634+0000 7fb4e8663700  1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 1 ==== mdsmap(e 21) v2 ==== 933+0+0 (crc 0 0 0) 0x562dc51424e0 con 0x562dc992c400
2021-11-05T20:08:31.634+0000 7fb4e8663700  5 mds.cephfs.smithi098.pucypu handle_mds_map old map epoch 21 <= 21, discarding
2021-11-05T20:08:31.634+0000 7fb4e8663700  1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 2 ==== mds_table_request(snaptable server_ready) v1 ==== 16+0+0 (crc 0 0 0) 0x562dc95be300 con 0x562dc992c400
2021-11-05T20:08:31.634+0000 7fb4e8663700  5 mds.1.6 got mds_table_request(snaptable server_ready) v1 from down/old/bad/imposter mds mds.?, dropping
2021-11-05T20:08:31.634+0000 7fb4e8663700  1 -- [v2:172.21.15.98:6826/647991324,v1:172.21.15.98:6827/647991324] <== mds.? v2:172.21.15.125:6826/1162235965 3 ==== mds_resolve(2+0 subtrees +0 peer requests) v1 ==== 89+0+0 (crc 0 0 0) 0x562dcc33e5a0 con 0x562dc992c400
2021-11-05T20:08:31.634+0000 7fb4e8663700  5 mds.1.6 got mds_resolve(2+0 subtrees +0 peer requests) v1 from down/old/bad/imposter mds mds.?, dropping

From: /ceph/teuthology-archive/pdonnell-2021-11-05_19:13:39-fs:upgrade-wip-pdonnell-testing-20211105.172813-distro-basic-smithi/6488028/remote/smithi098/log/cb9d093a-3e72-11ec-8c28-001a4aab830c/ceph-mds.cephfs.smithi098.pucypu.log-20211106.gz

rank 1 opened a connection with rank 0 when rank 0 was up:replay. This occurred before rank 0 was able to process its state change from mdsmap e19 and update its "myname" with the messenger:

https://github.com/ceph/ceph/blob/fb8671c5733dc4dfed79e42deafd33c46e78c519/src/mds/MDSRank.cc#L2250-L2257

Messenger ProtocolV2 now associates the daemon type / rank at connection creation so any updates by rank0 to its name are no longer propagated to its peers.

I have a reproducer working. I am in the process of finishing the fix.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #53445: pacific: mds: opening connection to up:replay/up:creating daemon causes message dropResolvedPatrick DonnellyActions
Copied to CephFS - Backport #53446: octopus: mds: opening connection to up:replay/up:creating daemon causes message dropRejectedPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly over 2 years ago

  • Category set to Correctness/Safety
  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 43850
Actions #2

Updated by Venky Shankar over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53445: pacific: mds: opening connection to up:replay/up:creating daemon causes message drop added
Actions #4

Updated by Backport Bot over 2 years ago

  • Copied to Backport #53446: octopus: mds: opening connection to up:replay/up:creating daemon causes message drop added
Actions #5

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #6

Updated by Patrick Donnelly about 1 year ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF