Project

General

Profile

Actions

Bug #57698

closed

osd/scrub: "scrub a chunk" requests are sent to the wrong set of replicas

Added by Ronen Friedman over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The Primary registers its intent to scrub with the 'get_actingset()', as it should.
But the actual chunk requests are sent to 'get_acting_recovery_backfill()'.
Now - if there is (or was during this interval) a replica that required backfilling -
the two sets do not match.


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #58636: quincy: osd/scrub: "scrub a chunk" requests are sent to the wrong set of replicasResolvedRonen FriedmanActions
Copied to RADOS - Backport #58637: pacific: osd/scrub: "scrub a chunk" requests are sent to the wrong set of replicasResolvedRonen FriedmanActions
Actions #1

Updated by Ronen Friedman over 1 year ago

  • Backport set to Pacific, Quincy
Actions #2

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to In Progress

IIRC Ronen has mentioned the scrub code interchanges get_acting_set() and get_acting_recovery_backfill().

Actions #3

Updated by Radoslaw Zarzynski over 1 year ago

What are symptoms? How bad is it? A hang maybe? I'm asking to understand the impact.

Actions #4

Updated by Ronen Friedman over 1 year ago

Following some discussions: here are excerpts from a run demonstrating this issue.
Test run rfriedma-2022-09-28_15:55:59-rados:thrash-wip-rf-tout-distro-default-smithi/7046786.

OSD 2 becomes the upset primary, but is then backfilled. OSD 0 is the acting primary, and initiates
a scrub.

osd 2:

2022-09-28T16:50:35.104+0000 7f806ec78700 10 osd.2 pg_epoch: 131 pg[3.15( v 128'285 (121'281,128'285] lb MIN local-lis/les=110/111 n=37 ec=52/20 lis/c=110/110 les/c/f=111/128/0 sis=129) [7,3,4] r=-1 lpr=129 DELETING pi=[110,129)/1 crt=128'285 lcod 128'283 mlcod 0'0 unknown NOTIFY mbc={} ps=[2~2,6~3,a~5,10~1,15~1,17~2,1b~1,1f~1,23~2]] handle_advance_map [7,3,4]/[7,3,4] -- 7/7
2022-09-28T16:50:35.104+0000 7f806ec78700 10 osd.2 pg_epoch: 132 pg[3.15( v 128'285 (121'281,128'285] lb MIN local-lis/les=110/111 n=37 ec=52/20 lis/c=110/110 les/c/f=111/128/0 sis=129) [7,3,4] r=-1 lpr=129 DELETING pi=[110,129)/1 crt=128'285 lcod 128'283 mlcod 0'0 unknown NOTIFY mbc={} ps=[2~2,6~3,a~5,10~1,15~1,17~2,1b~1,1f~1,23~2]] state<Started>: Started advmap
2022-09-28T16:50:35.104+0000 7f806ec78700 10 osd.2 pg_epoch: 132 pg[3.15( v 128'285 (121'281,128'285] lb MIN local-lis/les=110/111 n=37 ec=52/20 lis/c=110/110 les/c/f=111/128/0 sis=129) [7,3,4] r=-1 lpr=129 DELETING pi=[110,129)/1 crt=128'285 lcod 128'283 mlcod 0'0 unknown NOTIFY mbc={} ps=[2~2,6~3,a~5,10~1,15~1,17~2,1b~1,1f~1,23~2]] check_recovery_sources no source osds () went down

-- after repeeringm with a nwe acting set:

2022-09-28T16:53:29.143+0000 7f806ac70700  1 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
2022-09-28T16:53:29.143+0000 7f806ac70700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 unknown mbc={}] exit Start 0.000009 0 0.000000
2022-09-28T16:53:29.143+0000 7f806ac70700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Started/Primary
2022-09-28T16:53:29.143+0000 7f806ac70700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Started/Primary/Peering
2022-09-28T16:53:29.143+0000 7f806ac70700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] enter Started/Primary/Peering/GetInfo
2022-09-28T16:53:29.143+0000 7f806ac70700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] build_prior all_probe 0,3,7
2022-09-28T16:53:29.143+0000 7f806ac70700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] build_prior maybe_rw interval:248, acting: 0,3,7
2022-09-28T16:53:29.143+0000 7f806ac70700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] build_prior final: probe 0,2,3,7 down  blocked_by {}
2022-09-28T16:53:29.143+0000 7f806ac70700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] up_thru 300 < same_since 306, must notify monitor

--

2022-09-28T16:53:29.143+0000 7f806ac70700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] scrubber<NotActive>: on_primary_change (from on_info_history_change (plnd:)): Primary. <in-qu>&<registered> --> <in-qu>&<registered>
2022-09-28T16:53:29.143+0000 7f806ac70700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}]  osd.7 has stray content: 3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=17 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306 pruub=15.002688408s)
2022-09-28T16:53:29.143+0000 7f806ac70700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] update_heartbeat_peers 0,2,3 -> 0,2,3,7
2022-09-28T16:53:29.143+0000 7f806ac70700 20 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetInfo>: Adding osd: 7 peer features: 3f01cfbf7ffdffff

--
2022-09-28T16:53:29.146+0000 7f806ec78700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] enter Started/Primary/Peering/GetLog
2022-09-28T16:53:29.146+0000 7f806ec78700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.0 3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306 pruub=15.000658989s)
2022-09-28T16:53:29.146+0000 7f806ec78700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.2 3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306)
2022-09-28T16:53:29.146+0000 7f806ec78700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.3 3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306 pruub=15.001117706s)
2022-09-28T16:53:29.146+0000 7f806ec78700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting all_info osd.7 3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=17 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306 pruub=15.002688408s)
2022-09-28T16:53:29.146+0000 7f806ec78700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] select_replicated_primary newest update on osd.0 with 3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306 pruub=15.000658989s)
2022-09-28T16:53:29.146+0000 7f806ec78700 20 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_replicated candidates by cost are:
2022-09-28T16:53:29.146+0000 7f806ec78700 20 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_async_recovery_replicated result want=[0,3,7] async_recovery=
2022-09-28T16:53:29.146+0000 7f806ec78700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] choose_acting want [0,3,7] != acting [2,3,0], requesting pg_temp change
2022-09-28T16:53:29.146+0000 7f806ec78700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetLog 0.000089 0 0.000000
2022-09-28T16:53:29.146+0000 7f806ec78700 10 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Leaving Peering
2022-09-28T16:53:29.146+0000 7f806ec78700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering 0.003240 0 0.000000
2022-09-28T16:53:29.146+0000 7f806ec78700  5 osd.2 pg_epoch: 306 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=306 pi=[248,306)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Started/Primary/WaitActingChange

--

2022-09-28T16:53:30.142+0000 7f806ac70700 10 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0]/[0,3,7] r=-1 lpr=307 pi=[248,307)/1 crt=0'0 mlcod 0'0 remapped mbc={}]  noting past ([248,306] all_participants=0,2,3,7 intervals=([248,305] acting 0,3,7))
2022-09-28T16:53:30.142+0000 7f806ac70700 20 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=-1 lpr=307 pi=[248,307)/1 crt=0'0 mlcod 0'0 remapped mbc={}] on_new_interval
2022-09-28T16:53:30.142+0000 7f806ac70700 20 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=-1 lpr=307 pi=[248,307)/1 crt=0'0 mlcod 0'0 remapped mbc={}] on_new_interval upacting_features 0x3f01cfbf7ffdffff from [0,3,7]+[2,3,0]

--

on primary change (no longer the Primary)

2022-09-28T16:53:30.142+0000 7f806ac70700  1 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=-1 lpr=307 pi=[248,307)/1 crt=0'0 mlcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray

2022-09-28T16:53:30.142+0000 7f806ac70700 20 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0] r=0 lpr=307 pi=[248,306)/1 crt=0'0 mlcod 0'0 unknown mbc={}] new interval newup [2,3,0] newacting [0,3,7]

2022-09-28T16:53:30.143+0000 7f806ac70700 10 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=-1 lpr=307 pi=[248,307)/1 crt=0'0 mlcod 0'0 remapped NOTIFY mbc={}] do_peering_event: epoch_sent: 307 epoch_requested: 307 MQuery 3.15 from 0 query_epoch 307 query: query(info 0'0 epoch_sent 307)
2022-09-28T16:53:30.143+0000 7f806ac70700 15 osd.2 scrub-queue::remove_from_osd_queue removing pg[3.15] from OSD scrub queue
2022-09-28T16:53:30.143+0000 7f806ac70700  5 osd.2 scrub-queue::remove_from_osd_queue removing pg[3.15] failed. State was: unregistering
2022-09-28T16:53:30.143+0000 7f806ac70700 10 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=-1 lpr=307 pi=[248,307)/1 crt=0'0 mlcod 0'0 remapped NOTIFY mbc={}] scrubber<NotActive>: on_primary_change (from on_info_history_change (plnd:)): Replica/other. <in-qu>&<unregister> --> <in-qu>&<unregistering>
2022-09-28T16:53:30.143+0000 7f806ac70700 10 osd.2 pg_epoch: 307 pg[3.15( empty local-lis/les=0/0 n=0 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=-1 lpr=307 pi=[248,307)/1 crt=0'0 mlcod 0'0 remapped NOTIFY mbc={}] sending info

-----------

2022-09-28 16:53:31.149871800+00:00 -> 1664384011.1498718 osd.0 (osd.0) 63 : cluster [DBG] 3.15 starting backfill to osd.2 from (0'0,0'0] MAX to 245'389

2022-09-28 16:53:36.863501000+00:00 -> 1664384016.863501 osd.0 (osd.0) 65 : cluster [DBG] 3.15 scrub starts

OSD 0 - the scrubbing primary
-----------------------------

2022-09-28T16:53:30.141+0000 7fdd109f4700 10 osd.0 pg_epoch: 307 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=306) [2,3,0]/[0,3,7] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 remapped NOTIFY mbc={} ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] start_peering_interval: check_new_interval output: check_new_interval interval(306-306 up [2,3,0](2) acting [2,3,0](2)) up_thru 300 up_from 12 l\
ast_epoch_clean 249 interval(306-306 up [2,3,0](2) acting [2,3,0](2)) : primary up 12-300 does not include interval

--
2022-09-28T16:53:30.142+0000 7fdd109f4700 10 osd.0 pg_epoch: 307 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 remapped NOTIFY mbc={} ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] scrubber<NotActive>: on_primary_change (from on_info_history_change (plnd:)): Primary. <in-qu>&<registered> --> <in-qu>&<registered>
2022-09-28T16:53:30.142+0000 7fdd109f4700  1 osd.0 pg_epoch: 307 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 remapped NOTIFY mbc={} ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] start_peering_interval up [2,3,0] -> [2,3,0], acting [2,3,0] -> [0,3,7], acting_primary 2 -> 0, up_primary 2 -> 2, role 2 -> 0, features acting \
4540138320759226367 upacting 4540138320759226367
--

pool change?

2022-09-28T16:53:30.142+0000 7fdd109f4700 10 osd.0 pg_epoch: 307 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 remapped mbc={} ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] plpg_on_pool_change

--
2022-09-28T16:53:31.147+0000 7fdd109f4700 10 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=248/249 n=15 ec=52/20 lis/c=248/248 les/c/f=249/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 remapped mbc={} ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] state<Started/Primary/Active>: In Active, about to call activate
--
2022-09-28T16:53:31.167+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={} trimq=[e2~1] ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] update_calc_stats actingset 0,3,7 upset 0,2,3 acting_recovery_backfill 0,2,3,7
2022-09-28T16:53:31.167+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={} trimq=[e2~1] ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] update_calc_stats shard 0 primary objects 15 missing 0
2022-09-28T16:53:31.167+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={} trimq=[e2~1] ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] update_calc_stats shard 2 objects 0 missing 15
2022-09-28T16:53:31.167+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={} trimq=[e2~1] ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] update_calc_stats shard 3 objects 15 missing 0
2022-09-28T16:53:31.167+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={} trimq=[e2~1] ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] update_calc_stats shard 7 objects 17 missing 0
2022-09-28T16:53:31.167+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={} trimq=[e2~1] ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] update_calc_stats object_location_counts {0,3,7=15}
--
2022-09-28T16:53:31.167+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 308 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={} trimq=[e2~1] ps=[c8~1,cf~1,d1~1,d4~1,d7~1,d9~1,dc~1,e0~1]] update_calc_stats actingset 0,3,7 upset 0,2,3 acting_recovery_backfill 0,2,3,7
--

2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats actingset 0,3,7 upset 0,2,3 acting_recovery_backfill 0,2,3,7
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats shard 0 primary objects 15 missing 0
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats shard 2 objects 0 missing 15
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats shard 3 objects 15 missing 0
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats shard 7 objects 17 missing 0
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats object_location_counts {0,3,7=15}
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats missing shard 0 missing= 0
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats missing shard 3 missing= 0
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats missing shard 2 missing= 15
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats acting shard 7 missing= 0
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats degraded 0
2022-09-28T16:53:33.168+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 310 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling rops=1 mbc={} trimq=[e2~1] ps=[c8~1,dc~1,e0~1]] update_calc_stats misplaced 15
--

2022-09-28T16:53:36.394+0000 7fdd109f4700 10 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped+backfilling mbc={255={}} trimq=[e2~1,e4~3] ps=[c8~1,dc~1,e0~1]] needs_recovery is recovered
2022-09-28T16:53:36.394+0000 7fdd109f4700 10 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={255={}} trimq=[e2~1,e4~3] ps=[c8~1,dc~1,e0~1]] recovery done, backfill done
2022-09-28T16:53:36.394+0000 7fdd109f4700 10 osd.0 313 do_recovery started 0/1 on pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/248 les/c/f=308/249/0 sis=307) [2,3,0]/[0,3,7] backfill=[2] r=0 lpr=307 pi=[248,307)/1 crt=245'389 lcod 234'387 mlcod 0'0 active+remapped mbc={255={}} trimq=[e2~1,e4~3] ps=[c8~1,dc~1,e0~1]]

--

2022-09-28T16:53:36.395+0000 7fdd161ff700 20 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped trimq=[e2~1,e4~3] ps=[c8~1,dc~1,e0~1]] update_calc_stats actingset 0,3,7 upset 0,2,3 acting_recovery_backfill 0,2,3,7
--

and now we are clean enough to scrub:

2022-09-28T16:53:36.399+0000 7fdd0c9ec700 10 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] snap_trimmer complete
2022-09-28T16:53:36.862+0000 7fdd2ac3b700 15 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] sched_scrub pg(3.15) <active> <clean>
2022-09-28T16:53:36.862+0000 7fdd2ac3b700 10 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] validate_scrub_mode pg: 3.15 allow: 1/0 deep errs: 0 auto-repair: 0 (0)
2022-09-28T16:53:36.862+0000 7fdd2ac3b700 10 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] is_time_for_deep: need-auto? false allowed? true/false deep-errors? false last_deep_scrub_stamp 2022-09-28T16:47:52.198294+0000
2022-09-28T16:53:36.862+0000 7fdd2ac3b700 20 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] validate_periodic_mode: allowed:true/false t.f.d:false req:(plnd:)
2022-09-28T16:53:36.862+0000 7fdd2ac3b700 20 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] validate_periodic_mode will do shallow scrub (time_for_deep = 0)

--

404650:2022-09-28T16:53:36.862+0000 7fdd109f4700 10 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped+scrubbing [ 3.15:  ]  ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] scrubber<ReservingReplicas>: reserve_replicas
2404652:2022-09-28T16:53:36.862+0000 7fdd109f4700 10 osd.0 ep: 313 scrubber::ReplicaReservations pg[3.15]: ReplicaReservations: reserve 3
2404663:2022-09-28T16:53:36.862+0000 7fdd109f4700 10 osd.0 ep: 313 scrubber::ReplicaReservations pg[3.15]: ReplicaReservations: reserve 7

2022-09-28T16:53:36.867+0000 7fdd0c9ec700 15 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped+scrubbing [ 3.15:  ]  ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] scrubber<Act/WaitReplicas>:  b.e.: match_in_shards: auth_list: [0, 3, 7] #: 3; obj-errs#: 0
2022-09-28T16:53:36.867+0000 7fdd0c9ec700 15 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped+scrubbing [ 3.15:  ]  ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] scrubber<Act/WaitReplicas>:  b.e.: inconsistents: object errors #: 0  auth list #: 3  cur_missing #: 1  cur_incon #: 0
2022-09-28T16:53:36.867+0000 7fdd0c9ec700 -1 log_channel(cluster) log [ERR] : 3.15 shard 2 3:a85a0636:::smithi110102413-602:head : missing
2022-09-28T16:53:36.867+0000 7fdd0c9ec700 20 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped+scrubbing [ 3.15:  ]  ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] scrubber<Act/WaitReplicas>:  b.e.: select_auth_object: 3:a88a6498:::smithi110102413-164:head shard 0 got:{shard-usable: soid:3:a88a6498:::smithi110102413-164:head {txt:} }

we did not ask from [2]!
2022-09-28T16:53:36.864+0000 7fdd0c9ec700 10 osd.0 pg_epoch: 313 pg[3.15( v 245'389 (230'383,245'389] local-lis/les=307/308 n=15 ec=52/20 lis/c=307/307 les/c/f=308/313/0 sis=307) [2,3,0]/[0,3,7] r=0 lpr=307 crt=245'389 lcod 234'387 mlcod 0'0 active+clean+remapped+scrubbing [ 3.15:  ]  ps=[c8~1,dc~1,e0~1,e2~1,e4~3]] scrubber<Act/WaitLastUpdate>: _request_scrub_map scrubmap from osd.3 shallow
2

Actions #5

Updated by Neha Ojha over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 48347
Actions #6

Updated by Ronen Friedman over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #8

Updated by Neha Ojha over 1 year ago

  • Backport changed from Pacific, Quincy to pacific,quincy
Actions #9

Updated by Neha Ojha about 1 year ago

  • Copied to Backport #58636: quincy: osd/scrub: "scrub a chunk" requests are sent to the wrong set of replicas added
Actions #10

Updated by Neha Ojha about 1 year ago

  • Copied to Backport #58637: pacific: osd/scrub: "scrub a chunk" requests are sent to the wrong set of replicas added
Actions #11

Updated by Neha Ojha about 1 year ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF