Bug #13453: backfill pushes do not appear to increment the recovery rate counters - Ceph - Ceph

Actions

Copy link

Bug #13453

closed

backfill pushes do not appear to increment the recovery rate counters

Added by Loïc Dachary over 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

hammer,firefly(?)

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When ceph_manager.wait_for_recovery runs against

- thrashosds:
    timeout: 1200
    chance_pgnum_grow: 2
    chance_pgpnum_fix: 1

recovery stop making progress at some point. It will not show if the machine is fast enough to complete the job before the timeout.
For instance in rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/cache-agent-big.yaml} the workload is:

  - rados:
      clients:
      - client.0
      max_seconds: 1200
      objects: 6600
      op_weights:
        copy_from: 50
        delete: 50
        read: 100
        write: 100
      ops: 10000
      pools:
      - base
      size: 1024

and lasts 1200 seconds while the timeout waiting for recovery to make progress is also 1200. In most cases some progress will be made and the recovery timeout will not be triggered before the workload completes. But sometimes it does and that reveals the problem. The relevant log entries are:

2015-10-11T02:11:04.821 INFO:tasks.ceph.ceph_manager:waiting for recovery to complete
...
  36380:2015-10-11T02:24:54.897 INFO:tasks.ceph.ceph_manager:making progress, resetting timeout
  36567:2015-10-11T02:25:01.512 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now
...
2015-10-11T02:25:07.778 INFO:tasks.ceph.ceph_manager ... recovery 184/5876 objects degraded (3.131%)...
...
2015-10-11T02:45:02.844 INFO:tasks.rados.rados.0.target240173.stdout:6593: done (0 left)
2015-10-11T02:45:02.929 INFO:tasks.rados.rados.0.target240173.stderr:0 errors.
...
2015-10-11T02:45:12.012 INFO:tasks.ceph.ceph_manager ... recovery 2216/23749 objects degraded (9.331%)
...
2015-10-11T02:45:12.836 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now
2015-10-11T02:45:12.837 ERROR:teuthology.run_tasks:Manager failed: thrashosds

At 02:11 wait_for_recovery starts and sees progress until 02:24. After that time and until the timeout expires, it will no longer see progress until the timeout expires 20 minutes later (1200 seconds) at 02:45. During that interval, the number of degraded object grows from 184 at 02:25 to 2216 at 02:45. Despite of that, the rados workload completed successfully at 02:45. If given a little more time, wait_for_clean would probably have observed progress again and completed the job successfully.

Files

ceph-osd.4.log.gz (54.9 KB) ceph-osd.4.log.gz

Loïc Dachary, 10/13/2015 11:12 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #13453

backfill pushes do not appear to increment the recovery rate counters

Updated by Loïc Dachary over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Nathan Cutler over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Sage Weil over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Loïc Dachary over 8 years ago

Updated by Samuel Just about 8 years ago