Project

General

Profile

Actions

Bug #55009

open

Scrubbing exits due to error reading object head

Added by Laura Flores about 2 years ago. Updated 2 months ago.

Status:
Fix Under Review
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log 2-recovery-overrides/{more-partial-recovery} 3-scrub-overrides/{max-simultaneous-scrubs-2} backoff/peering ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/osd-delay msgr/async-v2only objectstore/bluestore-comp-zstd rados supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/cache-agent-big}

/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6747129

2022-03-19T19:53:07.393 INFO:tasks.ceph:Still waiting for all pgs to be scrubbed.
2022-03-19T19:53:10.548 INFO:tasks.ceph.osd.3.smithi123.stderr:2022-03-19T19:53:10.545+0000 7f2cf3076700 -1 osd.3 pg_epoch: 816 pg[4.0( v 816'11798 (106'8797,816'11798] local-lis/les=801/802 n=1002 ec=24/24 lis/c=801/801 les/c/f=802/802/0 sis=801) [3,7,0] r=0 lpr=801 crt=816'11798 lcod 816'11797 mlcod 816'11797 active+clean MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 7 (0x56314438d080)
2022-03-19T19:53:13.480 INFO:tasks.ceph.osd.3.smithi123.stderr:2022-03-19T19:53:13.477+0000 7f2cf2074700 -1 osd.3 pg_epoch: 818 pg[3.2s0( v 640'1548 (0'0,640'1548] local-lis/les=771/772 n=1321 ec=21/21 lis/c=771/771 les/c/f=772/772/0 sis=771) [3,5,6,4]p3(0) r=0 lpr=771 crt=640'1548 lcod 637'1547 mlcod 0'0 active+clean MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 6(2) (0x5631425969a0)
2022-03-19T19:53:13.481 INFO:tasks.ceph.osd.3.smithi123.stderr:2022-03-19T19:53:13.477+0000 7f2cf2074700 -1 osd.3 pg_epoch: 818 pg[3.2s0( v 640'1548 (0'0,640'1548] local-lis/les=771/772 n=1321 ec=21/21 lis/c=771/771 les/c/f=772/772/0 sis=771) [3,5,6,4]p3(0) r=0 lpr=771 crt=640'1548 lcod 637'1547 mlcod 0'0 active+clean MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 4(3) (0x563143a1b8c0)
2022-03-19T19:53:14.429 INFO:tasks.ceph.osd.3.smithi123.stderr:2022-03-19T19:53:14.429+0000 7f2cf707e700 -1 osd.3 pg_epoch: 818 pg[4.0( v 816'11798 (106'8797,816'11798] local-lis/les=801/802 n=1002 ec=24/24 lis/c=801/801 les/c/f=802/802/0 sis=801) [3,7,0] r=0 lpr=801 crt=816'11798 lcod 816'11797 mlcod 816'11797 active+clean MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 7 (0x5631436badc0)
2022-03-19T19:53:14.781 INFO:tasks.ceph.osd.2.smithi123.stderr:2022-03-19T19:53:14.777+0000 7f0ef4cca700 -1 osd.2 pg_epoch: 818 pg[2.1( empty local-lis/les=725/726 n=0 ec=16/16 lis/c=725/725 les/c/f=726/726/0 sis=725) [2,3,0] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 3 (0x55b0c21691e0)
2022-03-19T19:53:16.483 INFO:tasks.ceph.osd.4.smithi150.stderr:2022-03-19T19:53:16.480+0000 7f0d5e64c700 -1 osd.4 pg_epoch: 818 pg[3.3s0( v 642'1500 (0'0,642'1500] local-lis/les=787/788 n=1242 ec=21/21 lis/c=787/787 les/c/f=788/788/0 sis=787) [4,5,0,6]p4(0) r=0 lpr=787 crt=642'1500 mlcod 0'0 active+clean MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 6(3) (0x55b3ca33c000)

long pg dump...

2022-03-19T19:53:48.241 DEBUG:teuthology.orchestra.run.smithi123:> sudo ceph --cluster ceph config set global mon_health_to_clog false
2022-03-19T19:53:48.546 INFO:teuthology.misc:Shutting down mds daemons...
2022-03-19T19:53:48.547 INFO:teuthology.misc:Shutting down osd daemons...
2022-03-19T19:53:48.547 DEBUG:tasks.ceph.osd.0:waiting for process to exit
2022-03-19T19:53:48.547 INFO:teuthology.orchestra.run:waiting for 300

...

2022-03-19T19:53:55.366 INFO:tasks.ceph:Checking cluster log for badness...
2022-03-19T19:53:55.366 DEBUG:teuthology.orchestra.run.smithi123:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'but it is still running' | egrep -v 'objects unfound and apparently lost' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(POOL_' | egrep -v '\(CACHE_POOL_' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(OBJECT_' | egrep -v '\(SLOW_OPS\)' | egrep -v '\(REQUEST_SLOW\)' | egrep -v '\(TOO_FEW_PGS\)' | egrep -v 'slow request' | egrep -v 'must scrub before tier agent can activate' | head -n 1
2022-03-19T19:53:55.449 INFO:teuthology.orchestra.run.smithi123.stdout:2022-03-19T19:40:18.710528+0000 osd.5 (osd.5) 3 : cluster [ERR] Error -2 reading object 3:fbf04b8a:::smithi12323554-4688:head
2022-03-19T19:53:55.450 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2022-03-19T19:53:55.451 DEBUG:teuthology.orchestra.run.smithi123:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'but it is still running' | egrep -v 'objects unfound and apparently lost' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(POOL_' | egrep -v '\(CACHE_POOL_' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(OBJECT_' | egrep -v '\(SLOW_OPS\)' | egrep -v '\(REQUEST_SLOW\)' | egrep -v '\(TOO_FEW_PGS\)' | egrep -v 'slow request' | egrep -v 'must scrub before tier agent can activate' | head -n 1
2022-03-19T19:53:55.473 DEBUG:teuthology.orchestra.run.smithi123:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(OSD_SLOW_PING_TIME' | egrep -v 'but it is still running' | egrep -v 'objects unfound and apparently lost' | egrep -v 'overall HEALTH_' | egrep -v '\(OSDMAP_FLAGS\)' | egrep -v '\(OSD_' | egrep -v '\(PG_' | egrep -v '\(POOL_' | egrep -v '\(CACHE_POOL_' | egrep -v '\(SMALLER_PGP_NUM\)' | egrep -v '\(OBJECT_' | egrep -v '\(SLOW_OPS\)' | egrep -v '\(REQUEST_SLOW\)' | egrep -v '\(TOO_FEW_PGS\)' | egrep -v 'slow request' | egrep -v 'must scrub before tier agent can activate' | head -n 1
2022-03-19T19:53:55.488 INFO:teuthology.orchestra.run.smithi123.stdout:2022-03-19T19:40:18.710528+0000 osd.5 (osd.5) 3 : cluster [ERR] Error -2 reading object 3:fbf04b8a:::smithi12323554-4688:head


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #64017: osd/scrub: 'max_osd_scrub' default of '1' is too lowResolvedRonen Friedman

Actions
Actions #1

Updated by Laura Flores about 2 years ago

  • Related to Bug #53342: Exiting scrub checking -- not all pgs scrubbed added
Actions #2

Updated by Radoslaw Zarzynski about 2 years ago

  • Related to deleted (Bug #53342: Exiting scrub checking -- not all pgs scrubbed)
Actions #3

Updated by Radoslaw Zarzynski about 2 years ago

I think we should check the OSD's log to verify the reason behind the ENOENT.

Actions #4

Updated by Mark Holliman 12 months ago

I'm seeing what at least looks similar to this bug on a cluster running: ceph version 16.2.10

About a week ago we started getting a warning that pgs weren't being deep scrubbed in time. The number of pgs impacted continued to grow for awhile, when we tried restarting all the OSD daemons. This causes the number to drop for a short time, but then it beings to rise again and the deep scrubs continue to fail. I then issued the 'pg repair' command to a long list of the impacted pgs, which cleared some of them, but the rest continue to not finish the deep scrubs.

The OSD logs show:

May 03 14:27:27 sv-hdd-12-3 bash574619: debug 2023-05-03T14:27:27.280+0000 7fd801486700 -1 osd.156 pg_epoch: 222474 pg[6.d26( v 222472'6770711 (219524'6768728,222472'6770711] local-lis/les=220894/220895 n=10765 ec=91922/63312 lis/c=220894/220894 les/c/f=220895/220895/71442 sis=220894) [156,139,161] r=0 lpr=220894 crt=222472'6770711 lcod 222472'6770710 mlcod 222472'6770710 active+clean MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 139 (0x55baacfbfce0)
May 03 14:27:37 sv-hdd-12-3 bash574619: debug 2023-05-03T14:27:37.228+0000 7fd801486700 -1 osd.156 pg_epoch: 222474 pg[6.d26( v 222472'6770711 (219524'6768728,222472'6770711] local-lis/les=220894/220895 n=10765 ec=91922/63312 lis/c=220894/220894 les/c/f=220895/220895/71442 sis=220894) [156,139,161] r=0 lpr=220894 crt=222472'6770711 lcod 222472'6770710 mlcod 222472'6770710 active+clean MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] scrubber <NotActive/>: handle_scrub_reserve_grant: received unsolicited reservation grant from osd 139 (0x55baa7bd6f20)

Which run on indefinitely. I've tried restarting the all the RGW and OSD daemons on the cluster, but these errors start up again as soon as deep scrub ops begin. I've tried turning up the logging on the OSDs, but I can't see anything obvious showing why the deep scrubs are failing.

Any thoughts on what to check next?

Actions #5

Updated by Laura Flores 12 months ago

  • Assignee set to Ronen Friedman

Ronen, assigning to you in case you have any ideas.

Actions #6

Updated by Piotr Parczewski 12 months ago

In case it's useful, more detailed log (with debug level 10) from the same environment mentioned by Mark:

May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3f014de700 10 osd.102 229871 sched_scrub scheduled a scrub! (~6.1730~)
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3f014de700 10 osd.102 229871 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target 25 obj/sec or 5 MiB/sec
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3f014de700 10 osd.102 229871 promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <NotActive/>: scrubber event -->> StartScrub epoch: 229871
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <NotActive/>: reset_epoch state deep? 0
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <NotActive/>: update_op_mode_text: repair: visible: false, internal: false. Displayed: scrub
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <>: FSM: -- state -->> ReservingReplicas
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <>: reserve_replicas
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 ep: 229871 scrubber::ReplicaReservations pg[6.1730]: ReplicaReservations: reserve 50
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 ep: 229871 scrubber::ReplicaReservations pg[6.1730]: ReplicaReservations: reserve 166
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.977+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <ReservingReplicas/>: scrubber event --<< StartScrub
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 229871 dequeue_op 0x5646ca241ce0 prio 127 cost 0 latency 0.000021 MOSDScrubReserve(6.1730 REJECT e229871) v1 pg pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ]
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] _handle_message: 0x5646ca241ce0
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <ReservingReplicas/>: handle_scrub_reserve_reject MOSDScrubReserve(6.1730 REJECT e229871) v1
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 ep: 229871 scrubber::ReplicaReservations pg[6.1730]: handle_reserve_reject: rejected by 166
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 ep: 229871 scrubber::ReplicaReservations pg[6.1730]: handle_reserve_reject: osd.166 scrub reserve = fail
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 229871 dequeue_op 0x5646ca241ce0 finish
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <ReservingReplicas/>: scrubber event -->> send_reservation_failure epoch: 229871
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <ReservingReplicas/>: FSM: ReservingReplicas::react(const ReservationFailure&)
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean+scrubbing [ 6.1730:  ] ] scrubber <ReservingReplicas/>: clear_pgscrub_state
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean] scrubber <ReservingReplicas/>: clear_scrub_reservations
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean] scrubber <ReservingReplicas/>: reset_internal_state
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean] scrubber <>: FSM: -- state -->> NotActive
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean] scrubber <NotActive/>: scrubber event --<< send_reservation_failure
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 229871 dequeue_op 0x5646961c8420 prio 127 cost 0 latency 0.000014 MOSDScrubReserve(6.1730 REJECT e229871) v1 pg pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean]
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean] _handle_message: 0x5646961c8420
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 pg_epoch: 229871 pg[6.1730( v 229863'6802493 (223616'6800408,229863'6802493] local-lis/les=220878/220879 n=11251 ec=91939/63312 lis/c=220878/220878 les/c/f=220879/220879/71442 sis=220878) [102,50,166] r=0 lpr=220878 crt=229863'6802493 lcod 229863'6802492 mlcod 229863'6802492 active+clean] scrubber <NotActive/>: handle_scrub_reserve_reject MOSDScrubReserve(6.1730 REJECT e229871) v1
May 08 06:51:32 sv-hdd-12-1 bash[432401]: debug 2023-05-08T06:51:32.978+0000 7f3ee8a90700 10 osd.102 229871 dequeue_op 0x5646961c8420 finish

Actions #7

Updated by Radoslaw Zarzynski 12 months ago

  • Priority changed from Normal to High

Sounds entire scrubbing could get blocked.

Actions #8

Updated by Ronen Friedman 12 months ago

, Mark Holliman:
As a temporary step, I'd suggest increasing the osd_max_scrubs configuration parameter.
For a fuller solution: if possible - can you share debug logs with me?

email:

Actions #9

Updated by Radoslaw Zarzynski 11 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 51669

Ronen, is it fine treat 51669 as a fix for this?

Actions #10

Updated by Ronen Friedman 3 months ago

Radoslaw Zarzynski wrote:

Ronen, is it fine treat 51669 as a fix for this?

It would surely help in mitigating the effect of whatever caused the problem.

Actions #11

Updated by Laura Flores 2 months ago

  • Status changed from Fix Under Review to Resolved
Actions #12

Updated by Laura Flores 2 months ago

  • Status changed from Resolved to Pending Backport
Actions #13

Updated by Radoslaw Zarzynski 2 months ago

  • Status changed from Pending Backport to Fix Under Review
Actions #14

Updated by Radoslaw Zarzynski 2 months ago

  • Related to Bug #64017: osd/scrub: 'max_osd_scrub' default of '1' is too low added
Actions #15

Updated by Radoslaw Zarzynski 2 months ago

The mentioned PR solves https://tracker.ceph.com/issues/64017, so it's at least related to this bug. Maybe we should mark this one is its duplicate.

Actions #16

Updated by Radoslaw Zarzynski 2 months ago

scrub note: talked with Ronen – might be a duplicate but cannot be sure. Let's hunt for a reoccurence!

Actions

Also available in: Atom PDF