Project

General

Profile

Actions

Bug #59049

open

WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event

Added by Laura Flores about 1 year ago. Updated 11 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358

2023-03-11T10:28:47.074 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.263 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:28:47.263 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-03-11T10:28:47.264 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.280 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.338 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.392 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event

/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358/remote/smithi149/log/ceph-osd.3.log.gz

2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: DigestUpdate queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: FSM: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640  0 log_channel(cluster) log [WRN] : WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event --<< digest_update_notification
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: MapsCompared queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20  scrubberFSM  event: --vvvv---- MapsCompared
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitDigestUpdate>: FSM: -- state -->> Act/WaitDigestUpdate


Related issues 1 (0 open1 closed)

Copied to RADOS - Backport #61345: reef: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate eventResolvedSamuel JustActions
Actions #1

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-10_15:57:59-rados-wip-yuri4-testing-2023-03-09-1458-reef-distro-default-smithi/7201721

This was scheduled on a different wip, but with https://github.com/ceph/ceph/pull/49443 as the suite branch (same as the example above). It also failed on an ubuntu 22.04 tests. Thinking it could somehow be related.

Actions #2

Updated by Radoslaw Zarzynski about 1 year ago

IIRC you were looking into jammy things. If you came to another observations, might be worth adding them here as well.

Actions #3

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231129

Actions #4

Updated by Radoslaw Zarzynski about 1 year ago

  • Assignee set to Nitzan Mordechai

Not a high priority; good opportunity to learn.

Actions #5

Updated by Nitzan Mordechai about 1 year ago

According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list

Actions #6

Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from New to In Progress
Actions #7

Updated by Neha Ojha 12 months ago

Nitzan Mordechai wrote:

According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list

Is there a PR for it already?

Actions #8

Updated by Nitzan Mordechai 12 months ago

Neha Ojha wrote:

Nitzan Mordechai wrote:

According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list

Is there a PR for it already?

No, still looking on the logs, there are few comments in the mentioned PR that explain those warnings are intentional

Actions #9

Updated by Laura Flores 12 months ago

/a/lflores-2023-04-28_19:31:46-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7257789
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250370
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250382

Actions #10

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253562
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199

Actions #11

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255789
/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764

Actions #12

Updated by Ronen Friedman 12 months ago

This is an actual bug in the scrub code:

Working with Nitzan, here is what we've found out:

(based on logs from /a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764)

The relevant part of the Scrubber state machine: WaitReplicas ----> WaitDigestUpdates

In WR state: when all replicas' maps are received, we perform a sequence of operations that
is the heart of "scrubbing" - looking for problems in the scrub maps, and fixing them.
One of the first steps performed is sending (if needed) a set of digest writes. We then follow
with (again - if needed) other write operations. When all fixes to the current chunk are done, we
queue a MapsCompared scrub state-machine event.

The MapsCompared event, when reaching the Scrubber, causes a state change into WaitDigestUpdates,
where we ... well ... wait for the digest update operations to conclude. This is signalled by
another FSM event: DigestUpdate.

This is how things should work. The problem is that the DigestUpdate is queued with a possibly
higher priority than the MapsCompared event (due to a recent change). This might result in DigestUpdate
reaching us while still in WaitReplicas - triggering a cluster warning (but not causing any other
problem).

All this is evident in OSD3 logs:

2023-04-30T04:11:59.717+0000 7fcea8a2a640 15 osd.3 91 queue a scrub event (PGScrubMapsCompared(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 luod=91'4991 crt=91'5016 lcod 91'4990 mlcod 91'4990 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91

...

2023-04-30T04:11:59.721+0000 7fcea4a22640 15 osd.3 91 queue a scrub event (PGScrubDigestUpdate(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91

...

2023-04-30T04:11:59.725+0000 7fcea4a22640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 91
...

2023-04-30T04:11:59.725+0000 7fcea8a2a640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 91

Actions #13

Updated by Radoslaw Zarzynski 12 months ago

  • Priority changed from Normal to Urgent

Rising to urgent to not lose it from the sight line of Reef.

Actions #14

Updated by Neha Ojha 12 months ago

  • Status changed from In Progress to Fix Under Review
  • Assignee changed from Nitzan Mordechai to Ronen Friedman
  • Pull request ID set to 51426
Actions #15

Updated by Samuel Just 12 months ago

It's not obvious to me from the above why this started popping up in the last few weeks -- have you been able to identify the commit which introduced the regression?

Actions #16

Updated by Neha Ojha 11 months ago

  • Assignee changed from Ronen Friedman to Samuel Just
  • Pull request ID changed from 51426 to 51498
Actions #17

Updated by Laura Flores 11 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #18

Updated by Backport Bot 11 months ago

  • Copied to Backport #61345: reef: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event added
Actions #19

Updated by Backport Bot 11 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF