Project

General

Profile

Actions

Bug #42058

closed

OSD reconnected across map epochs, inconsistent pg logs created

Added by 相洋 于 over 4 years ago. Updated over 4 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Peering
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Get the lossless cluster connection between osd.2 and osd.47 for example.

When osd.47 is restarted and at the same time osd.2 has a op need to send to osd.47.

Then osd.2 shut down current connnection and begin to reconnect osd.47.

When osd.47 is up , osd.2 send a connect msg to osd.47 , osd.47 then send a reset session msg to osd.2.

Osd.2 then delete its out queue msg, and the op is lost.

In our luminous cluster, we met a rare case(3 replica, a little complex, hard to describe) that we had a one more op in a osd in a pg group, we only found that when scrubbing. At last we found that the op is lost in client endpoint when receiving a reset session msg.

All in all , I think when the connection is lossless, even the client endpoint's csq is not 0 , server endpoint should not send a reset session msg, it's better to send a retry session command with csq set to 0.


Related issues 1 (0 open1 closed)

Is duplicate of Messengers - Bug #36612: msg/async: connection stallResolved10/28/2018

Actions
Actions

Also available in: Atom PDF