Bug #59049
openWaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
0%
Description
/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358
2023-03-11T10:28:47.074 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.263 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:28:47.263 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-03-11T10:28:47.264 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.280 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.338 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.392 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358/remote/smithi149/log/ceph-osd.3.log.gz
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: DigestUpdate queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: FSM: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640 0 log_channel(cluster) log [WRN] : WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event --<< digest_update_notification
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: MapsCompared queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 scrubberFSM event: --vvvv---- MapsCompared
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitDigestUpdate>: FSM: -- state -->> Act/WaitDigestUpdate
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-10_15:57:59-rados-wip-yuri4-testing-2023-03-09-1458-reef-distro-default-smithi/7201721
This was scheduled on a different wip, but with https://github.com/ceph/ceph/pull/49443 as the suite branch (same as the example above). It also failed on an ubuntu 22.04 tests. Thinking it could somehow be related.
Updated by Radoslaw Zarzynski about 1 year ago
IIRC you were looking into jammy things. If you came to another observations, might be worth adding them here as well.
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231129
Updated by Radoslaw Zarzynski about 1 year ago
- Assignee set to Nitzan Mordechai
Not a high priority; good opportunity to learn.
Updated by Nitzan Mordechai about 1 year ago
According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list
Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to In Progress
Updated by Neha Ojha about 1 year ago
Nitzan Mordechai wrote:
According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list
Is there a PR for it already?
Updated by Nitzan Mordechai about 1 year ago
Neha Ojha wrote:
Nitzan Mordechai wrote:
According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list
Is there a PR for it already?
No, still looking on the logs, there are few comments in the mentioned PR that explain those warnings are intentional
Updated by Laura Flores about 1 year ago
/a/lflores-2023-04-28_19:31:46-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7257789
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250370
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250382
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253562
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255789
/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764
Updated by Ronen Friedman about 1 year ago
This is an actual bug in the scrub code:
Working with Nitzan, here is what we've found out:
(based on logs from /a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764)
The relevant part of the Scrubber state machine: WaitReplicas ----> WaitDigestUpdates
In WR state: when all replicas' maps are received, we perform a sequence of operations that
is the heart of "scrubbing" - looking for problems in the scrub maps, and fixing them.
One of the first steps performed is sending (if needed) a set of digest writes. We then follow
with (again - if needed) other write operations. When all fixes to the current chunk are done, we
queue a MapsCompared scrub state-machine event.
The MapsCompared event, when reaching the Scrubber, causes a state change into WaitDigestUpdates,
where we ... well ... wait for the digest update operations to conclude. This is signalled by
another FSM event: DigestUpdate.
This is how things should work. The problem is that the DigestUpdate is queued with a possibly
higher priority than the MapsCompared event (due to a recent change). This might result in DigestUpdate
reaching us while still in WaitReplicas - triggering a cluster warning (but not causing any other
problem).
All this is evident in OSD3 logs:
2023-04-30T04:11:59.717+0000 7fcea8a2a640 15 osd.3 91 queue a scrub event (PGScrubMapsCompared(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 luod=91'4991 crt=91'5016 lcod 91'4990 mlcod 91'4990 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91 ... 2023-04-30T04:11:59.721+0000 7fcea4a22640 15 osd.3 91 queue a scrub event (PGScrubDigestUpdate(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91 ... 2023-04-30T04:11:59.725+0000 7fcea4a22640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 91 ... 2023-04-30T04:11:59.725+0000 7fcea8a2a640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 91
Updated by Radoslaw Zarzynski about 1 year ago
- Priority changed from Normal to Urgent
Rising to urgent to not lose it from the sight line of Reef.
Updated by Neha Ojha about 1 year ago
- Status changed from In Progress to Fix Under Review
- Assignee changed from Nitzan Mordechai to Ronen Friedman
- Pull request ID set to 51426
Updated by Samuel Just about 1 year ago
It's not obvious to me from the above why this started popping up in the last few weeks -- have you been able to identify the commit which introduced the regression?
Updated by Laura Flores 12 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 12 months ago
- Copied to Backport #61345: reef: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event added