Bug #42058
closedOSD reconnected across map epochs, inconsistent pg logs created
0%
Description
Get the lossless cluster connection between osd.2 and osd.47 for example.
When osd.47 is restarted and at the same time osd.2 has a op need to send to osd.47.
Then osd.2 shut down current connnection and begin to reconnect osd.47.
When osd.47 is up , osd.2 send a connect msg to osd.47 , osd.47 then send a reset session msg to osd.2.
Osd.2 then delete its out queue msg, and the op is lost.
In our luminous cluster, we met a rare case(3 replica, a little complex, hard to describe) that we had a one more op in a osd in a pg group, we only found that when scrubbing. At last we found that the op is lost in client endpoint when receiving a reset session msg.
All in all , I think when the connection is lossless, even the client endpoint's csq is not 0 , server endpoint should not send a reset session msg, it's better to send a retry session command with csq set to 0.