Bug #42058
closedOSD reconnected across map epochs, inconsistent pg logs created
0%
Description
Get the lossless cluster connection between osd.2 and osd.47 for example.
When osd.47 is restarted and at the same time osd.2 has a op need to send to osd.47.
Then osd.2 shut down current connnection and begin to reconnect osd.47.
When osd.47 is up , osd.2 send a connect msg to osd.47 , osd.47 then send a reset session msg to osd.2.
Osd.2 then delete its out queue msg, and the op is lost.
In our luminous cluster, we met a rare case(3 replica, a little complex, hard to describe) that we had a one more op in a osd in a pg group, we only found that when scrubbing. At last we found that the op is lost in client endpoint when receiving a reset session msg.
All in all , I think when the connection is lossless, even the client endpoint's csq is not 0 , server endpoint should not send a reset session msg, it's better to send a retry session command with csq set to 0.
Updated by Greg Farnum over 4 years ago
- Status changed from New to In Progress
- Pull request ID set to 30609
Updated by 相洋 于 over 4 years ago
Assume pg 1.1a maps to osds[1,5,9], osd1 is the primary osd.
Time 1: osd1 osd5 osd9 was online and could send message to each other.
Time 2: old5, osd9 received an new osdmap that showed osd.1 was down ,and at the same time, osd1’s public network was down manually(physical down),but osd.0’s cluster network is still online.
Time 3:
Because of receiving an new osdmap that showed osd1 was down, osd5 and osd9 shutdowned their connections towards osd1 up (through mark_down() ). so there were no existing connections for osd1.
As for osd.1, connections between osd.5/osd.9 encountered a failure(disconnected by osd.5/osd.9 explicitly) and were going to enter STANDBY state . As a consequence, these connections were still existing( their cs_seq > 0).
After a short while, osd1 generated two scrub operations(enable deep-scrub) about updating some objects version info(scrub_snapshot_metadata()), and was going to reestablish connections among osd5 and osd9. When osd1 was sending the first operation op1(by send_message()), the cluster messenger would reconnect the osd5/osd9 and then placing the op1 in out_q。During the connection was enter STATE_OPEN, there was a RESETSESSION between osd1 and osd5/osd9, which lead osd1 to discard the msg in out_q (by was_session_reset()). After the connection was established, osd1 sent the second operation op2 to osd5/osd9.
Eventually, there two pg log were recorded on osd1(op1,op2), but only one pg log(op2) on osd5/osd9.
Time4: when osd1 public network recovered soon, during pg peering, the primary osd(osd1) could not find any difference about pg log among osd5 and osd9. When pg 1.1a deep-scrubed over, there would trigger an inconsistent error about object version info(the version info op1 associatived).
This is a rarely situation we meet with. In some case, I think this would cause the msgs out of order . If I misdiagnosed it,please tell me.
Updated by 相洋 于 over 4 years ago
see PR: https://github.com/ceph/ceph/pull/25343 which also avoid triggerring RESETSESSION.
Updated by Greg Farnum over 4 years ago
- Project changed from Messengers to RADOS
- Subject changed from 【msg/async】 bad ression at osd cluster messenger to OSD reconnected across map epochs, inconsistent pg logs created
- Category changed from AsyncMessenger to Peering
- Status changed from In Progress to New
- Component(RADOS) OSD added
Okay, so the issue here is that osd.1 managed to reconnect to osd.5 and osd.9 without triggering a wider reset of the PG state that canceled the ongoing background (scrub) operations. osd.{59} should have detected the OSDMap version mismatches and rejected ops from osd.1.
This is running on a Luminous 12.2.13 cluster, right?
Updated by Greg Farnum over 4 years ago
- Status changed from New to Duplicate
Oh sorry I didn't look at that PR. It is the correct fix; if we do another luminous point release it should show up or you can pull it in yourself. :)
(Note the distinction that it only changes the rules during the connect phase!)
Updated by Greg Farnum over 4 years ago
- Is duplicate of Bug #36612: msg/async: connection stall added
Updated by 相洋 于 over 4 years ago
Our cluster is running on Luminous 12.2.12.
I do not think PR https://github.com/ceph/ceph/pull/25343 can not solve the problem in our case.
So reconsider my PR: https://github.com/ceph/ceph/pull/30609?
Or can you give any suggestion to resolve the problem in other modules ?
Updated by 相洋 于 over 4 years ago
Greg Farnum wrote:
Okay, so the issue here is that osd.1 managed to reconnect to osd.5 and osd.9 without triggering a wider reset of the PG state that canceled the ongoing background (scrub) operations. osd.{59} should have detected the OSDMap version mismatches and rejected ops from osd.1.
osd.{59} have accepted the later ops because osd.{59} had not committed osdmap, although osd.{59} had discarded connection with osd.1.
This is running on a Luminous 12.2.13 cluster, right?
Updated by 相洋 于 over 4 years ago
https://tracker.ceph.com/issues/22570
@Greg Farnum, my problem is related to this tracker.
Problem can be resolved and closed.