Project

General

Profile

Actions

Bug #65491

open

recover_missing: racing read got wrong version

Added by Matan Breizman 15 days ago. Updated 9 days ago.

Status:
In Progress
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2024-04-11T12:39:43.111 INFO:tasks.rados.rados.0.smithi112.stderr:Error: racing read on 65 returned version 12 rather than version 11

See requests 16777355 and 16777354 from osd.1 pg[3.2]:

12:39:31,074 v 18'9 active+clean           ClientRequest::with_pg_int: client_request(id=16777353).0: pg active, entering process[_pg]_op
12:39:31,074 v 18'9 active+clean           ClientRequest::with_pg_int: client_request(id=16777353).0: pg active, entering process[_pg]_op
12:39:31,074                               do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:31,077 v 18'9 active+clean           ClientRequest::with_pg_int: client_request(id=16777354).0: pg active, entering process[_pg]_op
12:39:31,077 v 18'9 active+clean           ClientRequest::with_pg_int: client_request(id=16777355).0: pg active, entering process[_pg]_op
12:39:31,077 v 18'9 active+clean           ClientRequest::with_pg_int: client_request(id=16777355).0: pg active, entering process[_pg]_op
12:39:31,090                               do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:31,090                               do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:31,152 v 18'10 active+clean          ClientRequest::process_op:  client_request(id=16777353).0: checking already_complete
12:39:31,153                               do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:31,153 v 18'10 active+clean          ClientRequest::process_op:  client_request(id=16777354).0: checking already_complete
12:39:31,153                               do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:31,153 v 18'10 active+clean          ClientRequest::process_op:  client_request(id=16777353).0: entering process stage
12:39:31,154 v 18'10 active+clean          ClientRequest::process_op:  client_request(id=16777353).0: in process stage, calling do_process
12:39:31,220 v 18'11 active+clean          ClientRequest::with_pg_int: client_request(id=16777353).0: process[_pg]_op complete, completing handle
12:39:31,220 v 18'11 active+clean          ClientRequest::with_pg_int: client_request(id=16777353).0: process[_pg]_op complete, completing handle
12:39:31,221 v 18'11 active+clean          ClientRequest::with_pg_int: client_request(id=16777353).0: exit
12:39:31,221 v 18'11 active+clean          ClientRequest::process_op:  client_request(id=16777355).0: checking already_complete
12:39:31,221 v 18'11 active+clean          ClientRequest::process_op:  client_request(id=16777355).0: checking already_complete
12:39:31,221 v 18'11 active+clean          ClientRequest::process_op:  client_request(id=16777354).0: entering process stage
12:39:31,222 v 18'11 active+clean          ClientRequest::process_op:  client_request(id=16777354).0: in process stage, calling do_process

***Version bump***
12:39:31,224 v 18'11 active+clean          PeeringState::append_log append_log log((0'0,18'11], crt=18'12) {18'12 (18'12) modify   3:414dc9d4:::smithi11232367-65:head
12:39:31,225 v 18'12 active+clean          PeeringState::add_log_entry add_log_entry 18'12 (18'12) modify   3:414dc9d4:::smithi11232367-65:head 
******************

***************************************************
***** OSD.3 (not this) was killed and revived *****

12:39:31.250 INFO:tasks.thrashosds.thrasher:Killing osd 3, live_osds are [0, 1, 2, 3]
12:39:36.322 INFO:tasks.thrashosds.thrasher:Reviving osd 3

***************************************************
***************************************************

12:39:40,928 v 18'12 unknown               ClientRequest::Orderer::requeue: requeueing client_request(id=16777354)
12:39:40,928 v 18'12 unknown               ClientRequest::Orderer::requeue: requeueing client_request(id=16777354)
12:39:40,929 v 18'12 unknown               ClientRequest::Orderer::requeue: requeueing client_request(id=16777355)
12:39:40,935 v 18'12 peering               ClientRequest::with_pg_int: client_request(id=16777355).0: interrupted due to crimson::common::actingset_changed (acting set changed)

12:39:40,937 v 18'12 peering               ClientRequest::with_pg_int: client_request(id=16777354).0: interrupted due to crimson::common::actingset_changed (acting set changed)
12:39:40,937 v 18'12 peering               ClientRequest::with_pg_int: client_request(id=16777354).0: interrupted due to crimson::common::actingset_changed (acting set changed)
12:39:40,956 v 18'12 peering               ClientRequest::with_pg_int: client_request(id=16777354).0: exit

12:39:41,838 v 18'12 unknown               ClientRequest::Orderer::requeue: requeueing client_request(id=16777354)
12:39:41,839 v 18'12 unknown               ClientRequest::Orderer::requeue: requeueing client_request(id=16777355)

12:39:41,857 v 18'12 unknown               ClientRequest::with_pg_int: client_request(id=16777354).1: interrupted due to crimson::common::actingset_changed (acting set changed)
12:39:41,858 v 18'12 unknown               ClientRequest::with_pg_int: client_request(id=16777354).1: exit

12:39:41,871 v 18'12 peering               ClientRequest::with_pg_int: client_request(id=16777355).1: interrupted due to crimson::common::actingset_changed (acting set changed)
12:39:41,872 v 18'12 peering               ClientRequest::with_pg_int: client_request(id=16777355).1: exit

12:39:42,985 v 18'12 active+degraded       ClientRequest::with_pg_int: client_request(id=16777354).2: pg active, entering process[_pg]_op
12:39:42,986                               do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:42,986                               do_recover_missing need to wait for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:42,986 v 18'12 active+degraded       ClientRequest::with_pg_int: client_request(id=16777355).2: pg active, entering process[_pg]_op
12:39:42,986 v 18'12 active+degraded       BackgroundRecoveryT<T>::start: start
12:39:42,986 v 18'12 active+degraded       UrgentRecovery::do_recovery: do_recovery
12:39:43,104 v 18'12 active+recovery_wait  ClientRequest::process_op:  client_request(id=16777354).2: checking already_complete
12:39:43,105 v 18'12 active+recovery_wait  ClientRequest::with_pg_int: client_request(id=16777354).2: process[_pg]_op complete, completing handle
12:39:43,105                               do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
12:39:43,106 v 18'12 active+recovery_wait  ClientRequest::process_op:  client_request(id=16777355).2: checking already_complete
12:39:43,106 v 18'12 active+recovery_wait  ClientRequest::with_pg_int: client_request(id=16777354).2: exit
12:39:43,106 v 18'12 active+recovery_wait  ClientRequest::process_op:  client_request(id=16777355).2: entering process stage
12:39:43,107 v 18'12 active+recovery_wait  ClientRequest::process_op:  client_request(id=16777355).2: in process stage, calling do_process
12:39:43,112 v 18'12 active+recovery_wait  ClientRequest::with_pg_int: client_request(id=16777355).2: process[_pg]_op complete, completing handle
12:39:43,113 v 18'12 active+recovery_wait  ClientRequest::with_pg_int: client_request(id=16777355).2: exit
12:39:43,117 v 18'12 active+recovering     BackgroundRecoveryT<T>::start: start
12:39:43,117 v 18'12 active+recovering     BackgroundRecoveryT<T>::start: start
12:39:43,119 v 18'12 active+recovering     PeeringState::needs_recovery is recovered
12:39:43,119 v 18'12 active+recovering     PeeringState::needs_backfill does not need backfill
12:39:43,121 v 18'12 active                PeeringState::needs_recovery is recovered

Hypothesis 1:
Note: 'entering process[_pg]_op' is called before 'recover_missing'.
16777354 and 16777355 are in 'entering process[_pg]_op' concurrently, UrgentRecovery::do_recovery may have been scheduled by 16777355 and not by 16777354 (reordered) was in 'wait for recovery'.

16777354: entering process[_pg]_op
16777355: entering process[_pg]_op
_Unkownid_ : UrgentRecovery::do_recovery: do_recovery
16777354: process[_pg]_op complete, completing handle
16777355: process[_pg]_op complete, completing handle

Unkownid: reqid is not printed (Added logs in first comment).

Hypothesis 2:
See: 'Version bump'. Version was bumped to 12 and then both requests were requeued (requeueing client_request). When both started originally started with 11.
Need more info..

More info:

Sanitized log notes:
  • client_request(id=16777354, detail=m=[osd_op(client.4281.0:268 3.2 3:414dc9d4:::smithi11232367-65:head {setxattr _header (56) in=63b, truncate 3075648} snapc 0={} ondisk+write+known_if_redirected+supports_pool_eio e17)])
  • client_request(id=16777355, detail=m=[osd_op(client.4281.0:269 3.2 3:414dc9d4:::smithi11232367-65:head {read 0~1} snapc 0={} ondisk+read+rwordered+known_if_redirected+supports_pool_eio e17)])

osd.1 handling the read req:

DEBUG 2024-04-11 12:39:43,107 [shard 2:main] osd - do_osd_ops_execute: object 3:414dc9d4:::smithi11232367-65:head - handling op read

See: https://gist.github.com/Matan-B/02e88c375760c1cb29f87d39de9122c9

https://pulpito.ceph.com/matan-2024-04-11_11:23:27-crimson-rados-wip-matanb-crimson-replicated-testing-distro-crimson-smithi/7652230
https://pulpito.ceph.com/matan-2024-04-11_11:23:27-crimson-rados-wip-matanb-crimson-replicated-testing-distro-crimson-smithi/7652234/

Actions #1

Updated by Matan Breizman 15 days ago

  • Subject changed from UrgentRecovery: racing read got wrong version to UrgentRecovery: racing read got wrong version (Request reorder)
  • Description updated (diff)
  • Priority changed from Normal to High
Actions #2

Updated by Matan Breizman 15 days ago

  • Description updated (diff)
Actions #3

Updated by Matan Breizman 15 days ago

  • Pull request ID deleted (56832)

WIP

Actions #4

Updated by Matan Breizman 15 days ago

  • Subject changed from UrgentRecovery: racing read got wrong version (Request reorder) to recover_missing: racing read got wrong version (Request reorder)
  • Priority changed from High to Urgent
Actions #5

Updated by Matan Breizman 14 days ago

  • Description updated (diff)
Actions #6

Updated by Matan Breizman 14 days ago

  • Description updated (diff)
Actions #7

Updated by Matan Breizman 14 days ago

  • Description updated (diff)
Actions #8

Updated by Matan Breizman 14 days ago

Not a fix yet, bug I added few missing log lines that may help here:
https://github.com/ceph/ceph/pull/56916/commits/92a97ecd4eecd43f785dbd7c2b581b7316a35ca9

Actions #9

Updated by Matan Breizman 14 days ago

  • Description updated (diff)
Actions #10

Updated by Matan Breizman 14 days ago

  • Description updated (diff)
Actions #11

Updated by Matan Breizman 9 days ago

  • Subject changed from recover_missing: racing read got wrong version (Request reorder) to recover_missing: racing read got wrong version

Hypothesis 2:
See: 'Version bump'. Version was bumped to 12 and then both requests were requeued (requeueing client_request). When both started originally started with 11.

16777354 and 16777355 are sent from different clients (client.4281.0:268 and client.4281.0:269).
Request 16777355 was in `ClientRequest::process_op` when the last_update of the pg was 11 .
It looks like the request was requeued and the replied version was already 12 .
Looking whether requeueing may cause wrong version to be returned.

Actions

Also available in: Atom PDF