Project

General

Profile

Bug #42058

OSD reconnected across map epochs, inconsistent pg logs created

Added by 相洋 于 5 months ago. Updated 4 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Peering
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature:

Description

Get the lossless cluster connection between osd.2 and osd.47 for example.

When osd.47 is restarted and at the same time osd.2 has a op need to send to osd.47.

Then osd.2 shut down current connnection and begin to reconnect osd.47.

When osd.47 is up , osd.2 send a connect msg to osd.47 , osd.47 then send a reset session msg to osd.2.

Osd.2 then delete its out queue msg, and the op is lost.

In our luminous cluster, we met a rare case(3 replica, a little complex, hard to describe) that we had a one more op in a osd in a pg group, we only found that when scrubbing. At last we found that the op is lost in client endpoint when receiving a reset session msg.

All in all , I think when the connection is lossless, even the client endpoint's csq is not 0 , server endpoint should not send a reset session msg, it's better to send a retry session command with csq set to 0.


Related issues

Duplicates Messengers - Bug #36612: msg/async: connection stall Resolved 10/28/2018

History

#1 Updated by Greg Farnum 5 months ago

  • Status changed from New to In Progress
  • Pull request ID set to 30609

#2 Updated by 相洋 于 4 months ago

@Greg

Assume pg 1.1a maps to osds[1,5,9], osd1 is the primary osd.

Time 1: osd1 osd5 osd9 was online and could send message to each other.

Time 2: old5, osd9 received an new osdmap that showed osd.1 was down ,and at the same time, osd1’s public network was down manually(physical down),but osd.0’s cluster network is still online.

Time 3:
Because of receiving an new osdmap that showed osd1 was down, osd5 and osd9 shutdowned their connections towards osd1 up (through mark_down() ). so there were no existing connections for osd1.
As for osd.1, connections between osd.5/osd.9 encountered a failure(disconnected by osd.5/osd.9 explicitly) and were going to enter STANDBY state . As a consequence, these connections were still existing( their cs_seq > 0).
After a short while, osd1 generated two scrub operations(enable deep-scrub) about updating some objects version info(scrub_snapshot_metadata()), and was going to reestablish connections among osd5 and osd9. When osd1 was sending the first operation op1(by send_message()), the cluster messenger would reconnect the osd5/osd9 and then placing the op1 in out_q。During the connection was enter STATE_OPEN, there was a RESETSESSION between osd1 and osd5/osd9, which lead osd1 to discard the msg in out_q (by was_session_reset()). After the connection was established, osd1 sent the second operation op2 to osd5/osd9.
Eventually, there two pg log were recorded on osd1(op1,op2), but only one pg log(op2) on osd5/osd9.

Time4: when osd1 public network recovered soon, during pg peering, the primary osd(osd1) could not find any difference about pg log among osd5 and osd9. When pg 1.1a deep-scrubed over, there would trigger an inconsistent error about object version info(the version info op1 associatived).

This is a rarely situation we meet with. In some case, I think this would cause the msgs out of order . If I misdiagnosed it,please tell me.

#3 Updated by 相洋 于 4 months ago

see PR: https://github.com/ceph/ceph/pull/25343 which also avoid triggerring RESETSESSION.

#4 Updated by Greg Farnum 4 months ago

  • Project changed from Messengers to RADOS
  • Subject changed from 【msg/async】 bad ression at osd cluster messenger to OSD reconnected across map epochs, inconsistent pg logs created
  • Category changed from AsyncMessenger to Peering
  • Status changed from In Progress to New
  • Component(RADOS) OSD added

Okay, so the issue here is that osd.1 managed to reconnect to osd.5 and osd.9 without triggering a wider reset of the PG state that canceled the ongoing background (scrub) operations. osd.{59} should have detected the OSDMap version mismatches and rejected ops from osd.1.

This is running on a Luminous 12.2.13 cluster, right?

#5 Updated by Greg Farnum 4 months ago

  • Status changed from New to Duplicate

Oh sorry I didn't look at that PR. It is the correct fix; if we do another luminous point release it should show up or you can pull it in yourself. :)
(Note the distinction that it only changes the rules during the connect phase!)

#6 Updated by Greg Farnum 4 months ago

  • Duplicates Bug #36612: msg/async: connection stall added

#7 Updated by 相洋 于 4 months ago

Our cluster is running on Luminous 12.2.12.
I do not think PR https://github.com/ceph/ceph/pull/25343 can not solve the problem in our case.
So reconsider my PR: https://github.com/ceph/ceph/pull/30609?
Or can you give any suggestion to resolve the problem in other modules ?

#8 Updated by 相洋 于 4 months ago

Greg Farnum wrote:

Okay, so the issue here is that osd.1 managed to reconnect to osd.5 and osd.9 without triggering a wider reset of the PG state that canceled the ongoing background (scrub) operations. osd.{59} should have detected the OSDMap version mismatches and rejected ops from osd.1.

osd.{59} have accepted the later ops because osd.{59} had not committed osdmap, although osd.{59} had discarded connection with osd.1.

This is running on a Luminous 12.2.13 cluster, right?

#9 Updated by 相洋 于 4 months ago

https://tracker.ceph.com/issues/22570

@Greg, my problem is related to this tracker.

Problem can be resolved and closed.

Also available in: Atom PDF