Project

General

Profile

Actions

Bug #48029

open

Exiting scrub checking -- not all pgs scrubbed.

Added by Neha Ojha over 3 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-10-28T09:09:10.147 INFO:tasks.ceph:pgid 1.7 last_scrub_stamp 2020-10-28T08:43:48.710032+0000 time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=43, tm_sec=48, tm_wday=2, tm_yday=302, tm_isdst=-1) <= time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=54, tm_sec=24, tm_wday=2, tm_yday=302, tm_isdst=0)
2020-10-28T09:09:10.147 INFO:tasks.ceph:pgid 2.3 last_scrub_stamp 2020-10-28T08:43:55.800877+0000 time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=43, tm_sec=55, tm_wday=2, tm_yday=302, tm_isdst=-1) <= time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=54, tm_sec=24, tm_wday=2, tm_yday=302, tm_isdst=0)
2020-10-28T09:09:10.148 INFO:tasks.ceph:pgid 1.3 last_scrub_stamp 2020-10-28T08:43:48.710032+0000 time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=43, tm_sec=48, tm_wday=2, tm_yday=302, tm_isdst=-1) <= time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=54, tm_sec=24, tm_wday=2, tm_yday=302, tm_isdst=0)
2020-10-28T09:09:10.149 INFO:tasks.ceph:pgid 2.d last_scrub_stamp 2020-10-28T08:43:55.800877+0000 time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=43, tm_sec=55, tm_wday=2, tm_yday=302, tm_isdst=-1) <= time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=54, tm_sec=24, tm_wday=2, tm_yday=302, tm_isdst=0)
2020-10-28T09:09:10.150 INFO:tasks.ceph:pgid 2.17 last_scrub_stamp 2020-10-28T08:43:55.800877+0000 time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=43, tm_sec=55, tm_wday=2, tm_yday=302, tm_isdst=-1) <= time.struct_time(tm_year=2020, tm_mon=10, tm_mday=28, tm_hour=8, tm_min=54, tm_sec=24, tm_wday=2, tm_yday=302, tm_isdst=0)
2020-10-28T09:09:10.150 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/git.ceph.com_ceph_master/qa/tasks/ceph.py", line 1875, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/git.ceph.com_ceph_master/qa/tasks/ceph.py", line 1277, in osd_scrub_pgs
    raise RuntimeError('Exiting scrub checking -- not all pgs scrubbed.')

/a/teuthology-2020-10-28_07:01:02-rados-master-distro-basic-smithi/5567239

Usually we see this when some PGs are not active+clean but here they are.

Actions #1

Updated by Neha Ojha over 3 years ago

rados/singleton-nomsgr/{all/osd_stale_reads mon_election/connectivity rados supported-random-distro$/{ubuntu_latest}} - same test as before

/a/teuthology-2020-11-04_07:01:02-rados-master-distro-basic-smithi/5590002

Actions #2

Updated by Laura Flores almost 2 years ago

  • Backport set to pacific

/a/yuriw-2022-06-22_22:13:20-rados-wip-yuri3-testing-2022-06-22-1121-pacific-distro-default-smithi/6892691

Description: rados/singleton-nomsgr/{all/osd_stale_reads mon_election/classic rados supported-random-distro$/{centos_8}}

Actions #3

Updated by Radoslaw Zarzynski almost 2 years ago

The code that generated the exception is (from the main branch):

def osd_scrub_pgs(ctx, config):
  # ...
    while loop:
        stats = manager.get_pg_stats()
        timez = [(stat['pgid'],stat['last_scrub_stamp']) for stat in stats]
        loop = False
        thiscnt = 0
        re_scrub = []
        for (pgid, tmval) in timez:
            t = tmval[0:tmval.find('.')].replace(' ', 'T')
            pgtm = time.strptime(t, '%Y-%m-%dT%H:%M:%S')
            if pgtm > check_time_now:
                thiscnt += 1
            else:
                log.info('pgid %s last_scrub_stamp %s %s <= %s', pgid, tmval, pgtm, check_time_now)
                loop = True
                re_scrub.append(pgid)
        if thiscnt > prev_good:
            prev_good = thiscnt
            gap_cnt = 0
        else:
            gap_cnt += 1
            if gap_cnt % 6 == 0:
                for pgid in re_scrub:
                    # re-request scrub every so often in case the earlier
                    # request was missed.  do not do it every time because
                    # the scrub may be in progress or not reported yet and
                    # we will starve progress.
                    manager.raw_cluster_cmd('pg', 'deep-scrub', pgid)
            if gap_cnt > retries:
                raise RuntimeError('Exiting scrub checking -- not all pgs scrubbed.')
        if loop:
            log.info('Still waiting for all pgs to be scrubbed.')
            time.sleep(delays)

So the request to schedule deep-scrub got somehow ignored.

Actions

Also available in: Atom PDF