Bug #59049
openWaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
0%
Description
/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358
2023-03-11T10:28:47.074 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.263 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:28:47.263 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-03-11T10:28:47.264 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.280 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.338 DEBUG:teuthology.orchestra.run.smithi149:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(POOL_APP_NOT_ENABLED\)' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(OBJECT_' | egrep -v '\(PG_' | egrep -v '\(SLOW_OPS\)' | egrep -v 'overall HEALTH' | egrep -v 'slow request' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-03-11T10:28:47.392 INFO:teuthology.orchestra.run.smithi149.stdout:1678530338.9736485 osd.3 (osd.3) 104 : cluster [WRN] WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203358/remote/smithi149/log/ceph-osd.3.log.gz
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubDigestUpdate(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: DigestUpdate queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: FSM: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640 0 log_channel(cluster) log [WRN] : WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event --<< digest_update_notification
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) queued
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process 2.23 to_process <OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258)> waiting <> waiting_peering {}
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 op_wq(3) _process OpSchedulerItem(2.23 PGScrubMapsCompared(pgid=2.23epoch_queued=258 scrub-token=0) class_id 1 qos_cost 577 cost 52428800 e258) pg 0x55e1ee71d000
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] forward_scrub_event: MapsCompared queued at: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 258
2023-03-11T10:25:38.967+0000 7f90a9fb1640 20 scrubberFSM event: --vvvv---- MapsCompared
2023-03-11T10:25:38.967+0000 7f90a9fb1640 10 osd.3 pg_epoch: 258 pg[2.23( v 258'779 (258'754,258'779] local-lis/les=89/90 n=605 ec=13/13 lis/c=89/89 les/c/f=90/90/0 sis=89) [3,1] r=0 lpr=89 crt=258'779 lcod 258'778 mlcod 258'778 active+clean+scrubbing+deep [ 2.23: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitDigestUpdate>: FSM: -- state -->> Act/WaitDigestUpdate
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-10_15:57:59-rados-wip-yuri4-testing-2023-03-09-1458-reef-distro-default-smithi/7201721
This was scheduled on a different wip, but with https://github.com/ceph/ceph/pull/49443 as the suite branch (same as the example above). It also failed on an ubuntu 22.04 tests. Thinking it could somehow be related.
Updated by Radoslaw Zarzynski about 1 year ago
IIRC you were looking into jammy things. If you came to another observations, might be worth adding them here as well.
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231129
Updated by Radoslaw Zarzynski about 1 year ago
- Assignee set to Nitzan Mordechai
Not a high priority; good opportunity to learn.
Updated by Nitzan Mordechai about 1 year ago
According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list
Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to In Progress
Updated by Neha Ojha 12 months ago
Nitzan Mordechai wrote:
According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list
Is there a PR for it already?
Updated by Nitzan Mordechai 12 months ago
Neha Ojha wrote:
Nitzan Mordechai wrote:
According to PR https://github.com/ceph/ceph/pull/44050 we can ignore that warning, i'll add it to the ignore log list
Is there a PR for it already?
No, still looking on the logs, there are few comments in the mentioned PR that explain those warnings are intentional
Updated by Laura Flores 12 months ago
/a/lflores-2023-04-28_19:31:46-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7257789
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250370
/a/yuriw-2023-04-24_22:53:22-rados-wip-yuri10-testing-2023-04-18-0735-reef-distro-default-smithi/7250382
Updated by Laura Flores 12 months ago
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253562
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253199
Updated by Laura Flores 12 months ago
/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255789
/a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764
Updated by Ronen Friedman 12 months ago
This is an actual bug in the scrub code:
Working with Nitzan, here is what we've found out:
(based on logs from /a/yuriw-2023-04-27_14:24:15-rados-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7255764)
The relevant part of the Scrubber state machine: WaitReplicas ----> WaitDigestUpdates
In WR state: when all replicas' maps are received, we perform a sequence of operations that
is the heart of "scrubbing" - looking for problems in the scrub maps, and fixing them.
One of the first steps performed is sending (if needed) a set of digest writes. We then follow
with (again - if needed) other write operations. When all fixes to the current chunk are done, we
queue a MapsCompared scrub state-machine event.
The MapsCompared event, when reaching the Scrubber, causes a state change into WaitDigestUpdates,
where we ... well ... wait for the digest update operations to conclude. This is signalled by
another FSM event: DigestUpdate.
This is how things should work. The problem is that the DigestUpdate is queued with a possibly
higher priority than the MapsCompared event (due to a recent change). This might result in DigestUpdate
reaching us while still in WaitReplicas - triggering a cluster warning (but not causing any other
problem).
All this is evident in OSD3 logs:
2023-04-30T04:11:59.717+0000 7fcea8a2a640 15 osd.3 91 queue a scrub event (PGScrubMapsCompared(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 luod=91'4991 crt=91'5016 lcod 91'4990 mlcod 91'4990 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91 ... 2023-04-30T04:11:59.721+0000 7fcea4a22640 15 osd.3 91 queue a scrub event (PGScrubDigestUpdate(pgid=2.13epoch_queued=91 scrub-token=0)) for pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB]. Epoch: 91 ... 2023-04-30T04:11:59.725+0000 7fcea4a22640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> digest_update_notification epoch: 91 ... 2023-04-30T04:11:59.725+0000 7fcea8a2a640 10 osd.3 pg_epoch: 91 pg[2.13( v 91'5016 (91'4900,91'5016] local-lis/les=86/87 n=2842 ec=17/17 lis/c=86/86 les/c/f=87/87/0 sis=86) [3,2] r=0 lpr=86 crt=91'5016 lcod 91'5015 mlcod 91'5015 active+clean+scrubbing+deep [ 2.13: REQ_SCRUB ] MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber<Act/WaitReplicas>: scrubber event -->> send_maps_compared epoch: 91
Updated by Radoslaw Zarzynski 12 months ago
- Priority changed from Normal to Urgent
Rising to urgent to not lose it from the sight line of Reef.
Updated by Samuel Just 12 months ago
It's not obvious to me from the above why this started popping up in the last few weeks -- have you been able to identify the commit which introduced the regression?
Updated by Laura Flores 11 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 11 months ago
- Copied to Backport #61345: reef: WaitReplicas::react(const DigestUpdate&): Unexpected DigestUpdate event added