Project

General

Profile

Actions

Bug #13453

closed

backfill pushes do not appear to increment the recovery rate counters

Added by Loïc Dachary over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer,firefly(?)
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When ceph_manager.wait_for_recovery runs against

- thrashosds:
    timeout: 1200
    chance_pgnum_grow: 2
    chance_pgpnum_fix: 1

recovery stop making progress at some point. It will not show if the machine is fast enough to complete the job before the timeout.
For instance in rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/cache-agent-big.yaml} the workload is:
  - rados:
      clients:
      - client.0
      max_seconds: 1200
      objects: 6600
      op_weights:
        copy_from: 50
        delete: 50
        read: 100
        write: 100
      ops: 10000
      pools:
      - base
      size: 1024

and lasts 1200 seconds while the timeout waiting for recovery to make progress is also 1200. In most cases some progress will be made and the recovery timeout will not be triggered before the workload completes. But sometimes it does and that reveals the problem. The relevant log entries are:
2015-10-11T02:11:04.821 INFO:tasks.ceph.ceph_manager:waiting for recovery to complete
...
  36380:2015-10-11T02:24:54.897 INFO:tasks.ceph.ceph_manager:making progress, resetting timeout
  36567:2015-10-11T02:25:01.512 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now
...
2015-10-11T02:25:07.778 INFO:tasks.ceph.ceph_manager ... recovery 184/5876 objects degraded (3.131%)...
...
2015-10-11T02:45:02.844 INFO:tasks.rados.rados.0.target240173.stdout:6593: done (0 left)
2015-10-11T02:45:02.929 INFO:tasks.rados.rados.0.target240173.stderr:0 errors.
...
2015-10-11T02:45:12.012 INFO:tasks.ceph.ceph_manager ... recovery 2216/23749 objects degraded (9.331%)
...
2015-10-11T02:45:12.836 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now
2015-10-11T02:45:12.837 ERROR:teuthology.run_tasks:Manager failed: thrashosds

At 02:11 wait_for_recovery starts and sees progress until 02:24. After that time and until the timeout expires, it will no longer see progress until the timeout expires 20 minutes later (1200 seconds) at 02:45. During that interval, the number of degraded object grows from 184 at 02:25 to 2216 at 02:45. Despite of that, the rados workload completed successfully at 02:45. If given a little more time, wait_for_clean would probably have observed progress again and completed the job successfully.


Files

ceph-osd.4.log.gz (54.9 KB) ceph-osd.4.log.gz Loïc Dachary, 10/13/2015 11:12 PM
Actions #1

Updated by Loïc Dachary over 8 years ago

  • Description updated (diff)
Actions #2

Updated by Loïc Dachary over 8 years ago

@Samuel Hassine: I'm not sure which way to go with this bug (assuming my reading is correct). It could be considered a bug that recovery stops making progress while the workload is running. In which case the right way to fix this bug would be to make it so such a bug always shows (instead of showing sometimes) by changing the exec time of the workload to 3600 instead of 1200 which will always timeout the recovery observer. If, however, it is acceptable that recovery does not make progress while the workload runs, the bug fix would be to increase the wait_for_recovery timeout to be much larger than the workload execution time to give it the time it needs to complete when the workload finishes.

Actions #3

Updated by Nathan Cutler over 8 years ago

@Loïc Dachary: Could this be showing up in the firefly upgrade suite, too? This test has very similar log messages to the ones in this issue: http://pulpito.ceph.com/loic-2015-09-27_01:05:46-upgrade:firefly-firefly-backports---basic-vps/1071164/

@Nathan Weinberg It does look similar indeed. The operations involved are even more complicated and the reason why it would stall recovery are even more difficult to figure out in your case.

2015-09-28T13:21:19.917 INFO:tasks.thrashosds.ceph_manager ... u'recovering_bytes_per_sec': 27793865 ...
...
2015-09-28T13:41:40.653 ERROR:teuthology.run_tasks:Saw exception from tasks.
AssertionError: failed to recover before timeout expired
...

Actions #5

Updated by Samuel Just over 8 years ago

  • Project changed from 23 to Ceph
  • Subject changed from thrashosd: ceph_manager.wait_for_recovery races against thrashosds pggrow to backfill pushes do not appear to increment the recovery rate counters
  • Priority changed from Normal to Urgent

It seems that neither ReplicatedBackend nor ECBackend increments the num_recovered counter and friends when not doing a local recovery. on_local_recovery currently gets passed a stat delta to use to update the stats. This should probably be passed to on_global_recovery instead so that it's called even if the local copy didn't need to be recovered.

This tends to cause very slow backfills to erroneously trip the recovery timeout on thrasher tests.

Actions #6

Updated by Samuel Just over 8 years ago

  • Backport set to hammer,firefly(?)

Not sure about backport appropriateness.

Actions #7

Updated by Samuel Just over 8 years ago

I think that on_global_recover should get a stat_delta with

objects_recovered = 1
bytes_recovered = object_size * num_replicas_recovered * (1/K for ec?)
keys_recovered = keys_in_object * num_replicas_recovered

Actions #8

Updated by Loïc Dachary over 8 years ago

  • Status changed from New to 12
  • Assignee set to Loïc Dachary
Actions #10

Updated by Loïc Dachary over 8 years ago

Adding part of the osd.4 logs right in the middle of what was mistaken as no recovery progress. The recovery related messages found in it help navigate the code.

Actions #11

Updated by Loïc Dachary over 8 years ago

  • Status changed from 12 to Fix Under Review
Actions #12

Updated by Sage Weil over 8 years ago

  • Status changed from Fix Under Review to In Progress
Actions #13

Updated by Loïc Dachary over 8 years ago

Sam included it yesterday in his testing branch.

Actions #14

Updated by Loïc Dachary over 8 years ago

  • Assignee changed from Loïc Dachary to Samuel Just
Actions #15

Updated by Samuel Just about 8 years ago

  • Status changed from In Progress to Resolved

b04e17687fcb4609a8c6c3623bd0786116733465

Actions

Also available in: Atom PDF