Bug #13453: backfill pushes do not appear to increment the recovery rate counters - Ceph - Ceph

Actions

Copy link

Bug #13453

closed

backfill pushes do not appear to increment the recovery rate counters

Added by Loïc Dachary over 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

hammer,firefly(?)

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When ceph_manager.wait_for_recovery runs against

- thrashosds:
    timeout: 1200
    chance_pgnum_grow: 2
    chance_pgpnum_fix: 1

recovery stop making progress at some point. It will not show if the machine is fast enough to complete the job before the timeout.
For instance in rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/cache-agent-big.yaml} the workload is:

  - rados:
      clients:
      - client.0
      max_seconds: 1200
      objects: 6600
      op_weights:
        copy_from: 50
        delete: 50
        read: 100
        write: 100
      ops: 10000
      pools:
      - base
      size: 1024

and lasts 1200 seconds while the timeout waiting for recovery to make progress is also 1200. In most cases some progress will be made and the recovery timeout will not be triggered before the workload completes. But sometimes it does and that reveals the problem. The relevant log entries are:

2015-10-11T02:11:04.821 INFO:tasks.ceph.ceph_manager:waiting for recovery to complete
...
  36380:2015-10-11T02:24:54.897 INFO:tasks.ceph.ceph_manager:making progress, resetting timeout
  36567:2015-10-11T02:25:01.512 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now
...
2015-10-11T02:25:07.778 INFO:tasks.ceph.ceph_manager ... recovery 184/5876 objects degraded (3.131%)...
...
2015-10-11T02:45:02.844 INFO:tasks.rados.rados.0.target240173.stdout:6593: done (0 left)
2015-10-11T02:45:02.929 INFO:tasks.rados.rados.0.target240173.stderr:0 errors.
...
2015-10-11T02:45:12.012 INFO:tasks.ceph.ceph_manager ... recovery 2216/23749 objects degraded (9.331%)
...
2015-10-11T02:45:12.836 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now
2015-10-11T02:45:12.837 ERROR:teuthology.run_tasks:Manager failed: thrashosds

At 02:11 wait_for_recovery starts and sees progress until 02:24. After that time and until the timeout expires, it will no longer see progress until the timeout expires 20 minutes later (1200 seconds) at 02:45. During that interval, the number of degraded object grows from 184 at 02:25 to 2216 at 02:45. Despite of that, the rados workload completed successfully at 02:45. If given a little more time, wait_for_clean would probably have observed progress again and completed the job successfully.

Files

ceph-osd.4.log.gz (54.9 KB) ceph-osd.4.log.gz

Loïc Dachary, 10/13/2015 11:12 PM

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Description updated (diff)

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

@Samuel Hassine: I'm not sure which way to go with this bug (assuming my reading is correct). It could be considered a bug that recovery stops making progress while the workload is running. In which case the right way to fix this bug would be to make it so such a bug always shows (instead of showing sometimes) by changing the exec time of the workload to 3600 instead of 1200 which will always timeout the recovery observer. If, however, it is acceptable that recovery does not make progress while the workload runs, the bug fix would be to increase the wait_for_recovery timeout to be much larger than the workload execution time to give it the time it needs to complete when the workload finishes.

Actions

Copy link

Updated by Nathan Cutler over 8 years ago

@Loïc Dachary: Could this be showing up in the firefly upgrade suite, too? This test has very similar log messages to the ones in this issue: http://pulpito.ceph.com/loic-2015-09-27_01:05:46-upgrade:firefly-firefly-backports---basic-vps/1071164/

@Nathan Weinberg It does look similar indeed. The operations involved are even more complicated and the reason why it would stall recovery are even more difficult to figure out in your case.

2015-09-28T13:21:19.917 INFO:tasks.thrashosds.ceph_manager ... u'recovering_bytes_per_sec': 27793865 ...
...
2015-09-28T13:41:40.653 ERROR:teuthology.run_tasks:Saw exception from tasks.
AssertionError: failed to recover before timeout expired
...

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Actions

Copy link

Updated by Samuel Just over 8 years ago

Project changed from 23 to Ceph
Subject changed from thrashosd: ceph_manager.wait_for_recovery races against thrashosds pggrow to backfill pushes do not appear to increment the recovery rate counters
Priority changed from Normal to Urgent

It seems that neither ReplicatedBackend nor ECBackend increments the num_recovered counter and friends when not doing a local recovery. on_local_recovery currently gets passed a stat delta to use to update the stats. This should probably be passed to on_global_recovery instead so that it's called even if the local copy didn't need to be recovered.

This tends to cause very slow backfills to erroneously trip the recovery timeout on thrasher tests.

Actions

Copy link

Updated by Samuel Just over 8 years ago

Backport set to hammer,firefly(?)

Not sure about backport appropriateness.

Actions

Copy link

Updated by Samuel Just over 8 years ago

I think that on_global_recover should get a stat_delta with

objects_recovered = 1
bytes_recovered = object_size * num_replicas_recovered * (1/K for ec?)
keys_recovered = keys_in_object * num_replicas_recovered

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Status changed from New to 12
Assignee set to Loïc Dachary

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Same error, with radosbench this time

rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml msgr-failures/few.yaml thrashers/morepggrow.yaml workloads/radosbench.yaml}

Actions

Copy link

#10

Updated by Loïc Dachary over 8 years ago

File ceph-osd.4.log.gz ceph-osd.4.log.gz added

Adding part of the osd.4 logs right in the middle of what was mistaken as no recovery progress. The recovery related messages found in it help navigate the code.

Actions

Copy link

#11