Bug #13453
Updated by Loïc Dachary over 8 years ago
When "ceph_manager.wait_for_recovery":https://github.com/ceph/ceph-qa-suite/blob/master/tasks/ceph_manager.py#L696 runs against <pre> - thrashosds: timeout: 1200 chance_pgnum_grow: 2 chance_pgpnum_fix: 1 </pre> recovery stop making progress at some point. It will not show if the machine is fast enough to complete the job before the timeout. For instance in *rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/cache-agent-big.yaml}* the workload is: <pre> - rados: clients: - client.0 max_seconds: 1200 objects: 6600 op_weights: copy_from: 50 delete: 50 read: 100 write: 100 ops: 10000 pools: - base size: 1024 </pre> and lasts *1200* seconds while the timeout waiting for recovery to make progress is also *1200*. In most cases some progress will be made and the recovery timeout will not be triggered before the workload completes. But sometimes it does and that reveals the problem. The relevant log entries are: <pre> 2015-10-11T02:11:04.821 INFO:tasks.ceph.ceph_manager:waiting for recovery to complete ... 36380:2015-10-11T02:24:54.897 INFO:tasks.ceph.ceph_manager:making progress, resetting timeout 36567:2015-10-11T02:25:01.512 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now ... 2015-10-11T02:25:07.778 INFO:tasks.ceph.ceph_manager ... recovery 184/5876 objects degraded (3.131%)... ... 2015-10-11T02:45:02.844 INFO:tasks.rados.rados.0.target240173.stdout:6593: done (0 left) 2015-10-11T02:45:02.929 INFO:tasks.rados.rados.0.target240173.stderr:0 errors. ... 2015-10-11T02:45:12.012 INFO:tasks.ceph.ceph_manager ... recovery 2216/23749 objects degraded (9.331%) ... 2015-10-11T02:45:12.836 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now 2015-10-11T02:45:12.837 ERROR:teuthology.run_tasks:Manager failed: thrashosds </pre> At 02:11 wait_for_recovery starts and sees progress until 02:24. After that time and until the timeout expires, it will no longer see progress until the timeout expires 20 minutes later (1200 seconds) at 02:45. During that interval, the number of degraded object grows from 184 at 02:25 to 2216 at 02:45. Despite of that, the rados workload completed successfully at 02:45. If given a little more time, wait_for_clean would probably have observed progress again and completed the job successfully.