Bug #59049: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event - RADOS - Ceph

Actions

Copy link

Bug #59049

open

WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event

Added by Laura Flores about 1 year ago. Updated 11 months ago.

Status:

Pending Backport

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

51498

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358

2023-03-11T10:28:47.074 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.263 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:28:47.263 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-03-11T10:28:47.264 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.280 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.338 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.392 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event

/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358/remote/smithi149/log/ceph-osd.3.log.gz

2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: DigestUpdate queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: FSM: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640  0 log_channel(cluster) log [WRN] : WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event --<< digest_update_notification
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: MapsCompared queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20  scrubberFSM  event: --vvvv---- MapsCompared
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitDigestUpdate>: FSM: -- state -->> Act/WaitDigestUpdate

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-10_15:57:59-rados-wip-yuri4-testing-2023-03-09-1458-reef-distro-default-smithi/7201721

This was scheduled on a different wip, but with https://github.com/ceph/ceph/pull/49443 as the suite branch (same as the example above). It also failed on an ubuntu 22.04 tests. Thinking it could somehow be related.

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

IIRC you were looking into jammy things. If you came to another observations, might be worth adding them here as well.

Actions

Copy link

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231129

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Assignee set to Nitzan Mordechai

Not a high priority; good opportunity to learn.

Actions

Copy link

Updated by Nitzan Mordechai about 1 year ago

According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Status changed from New to In Progress

Actions

Copy link

Updated by Neha Ojha 12 months ago

Nitzan Mordechai wrote:

According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list

Is there a PR for it already?

Actions

Copy link

Updated by Nitzan Mordechai 12 months ago

Neha Ojha wrote:

Nitzan Mordechai wrote:

According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list

Is there a PR for it already?

No, still looking on the logs, there are few comments in the mentioned PR that explain those warnings are intentional

Actions

Copy link

Updated by Laura Flores 12 months ago

/a/lflores-2023-04-28_19:31:46-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7257789
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250370
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250382

Actions

Copy link

#10

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253562
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199

Actions

Copy link

#11

Updated by Laura Flores 12 months ago

/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255789
/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764

Actions

Copy link

#12

Updated by Ronen Friedman 12 months ago

This is an actual bug in the scrub code:

Working with Nitzan, here is what we've found out:

(based on logs from /a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764)

The relevant part of the Scrubber state machine: WaitReplicas ----> WaitDigestUpdates

In WR state: when all replicas' maps are received, we perform a sequence of operations that
is the heart of "scrubbing" - looking for problems in the scrub maps, and fixing them.
One of the first steps performed is sending (if needed) a set of digest writes. We then follow
with (again - if needed) other write operations. When all fixes to the current chunk are done, we
queue a MapsCompared scrub state-machine event.

The MapsCompared event, when reaching the Scrubber, causes a state change into WaitDigestUpdates,
where we ... well ... wait for the digest update operations to conclude. This is signalled by
another FSM event: DigestUpdate.

This is how things should work. The problem is that the DigestUpdate is queued with a possibly
higher priority than the MapsCompared event (due to a recent change). This might result in DigestUpdate
reaching us while still in WaitReplicas - triggering a cluster warning (but not causing any other
problem).

All this is evident in OSD3 logs:

2023-04-30T04:11:59.717+0000 7fcea8a2a640 15 osd.3 91 queue a scrub event (PGScrubMapsCompared(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 luod=91'4991 crt=91'5016 lcod 91'4990 mlcod 91'4990 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91

...

2023-04-30T04:11:59.721+0000 7fcea4a22640 15 osd.3 91 queue a scrub event (PGScrubDigestUpdate(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91

...

2023-04-30T04:11:59.725+0000 7fcea4a22640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 91
...

2023-04-30T04:11:59.725+0000 7fcea8a2a640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13:  REQ_SCRUB ]  MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 91

Actions

Copy link

#13

Updated by Radoslaw Zarzynski 12 months ago

Priority changed from Normal to Urgent

Rising to urgent to not lose it from the sight line of Reef.

Actions

Copy link

#14

Updated by Neha Ojha 12 months ago

Status changed from In Progress to Fix Under Review
Assignee changed from Nitzan Mordechai to Ronen Friedman
Pull request ID set to 51426

Actions

Copy link

#15

Updated by Samuel Just 12 months ago

It's not obvious to me from the above why this started popping up in the last few weeks -- have you been able to identify the commit which introduced the regression?

Actions

Copy link

#16