Project

General

Profile

Bug #13453

Updated by Loïc Dachary over 8 years ago

When "ceph_manager.wait_for_recovery":https://github.com/ceph/ceph-qa-suite/blob/master/tasks/ceph_manager.py#L696 runs against 
 <pre> 
 - thrashosds: 
     timeout: 1200 
     chance_pgnum_grow: 2 
     chance_pgpnum_fix: 1 
 </pre> 
 recovery stop making progress at some point. It will not show if the machine is fast enough to complete the job before the timeout. 
 For instance in *rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/osd-delay.yaml thrashers/pggrow.yaml workloads/cache-agent-big.yaml}* the workload is: 
 <pre> 
   - rados: 
       clients: 
       - client.0 
       max_seconds: 1200 
       objects: 6600 
       op_weights: 
         copy_from: 50 
         delete: 50 
         read: 100 
         write: 100 
       ops: 10000 
       pools: 
       - base 
       size: 1024 
 </pre> 
 and lasts *1200* seconds while the timeout waiting for recovery to make progress is also *1200*. In most cases some progress will be made and the recovery timeout will not be triggered before the workload completes. But sometimes it does and that reveals the problem. The relevant log entries are: 
 <pre> 
 2015-10-11T02:11:04.821 INFO:tasks.ceph.ceph_manager:waiting for recovery to complete 
 ... 
   36380:2015-10-11T02:24:54.897 INFO:tasks.ceph.ceph_manager:making progress, resetting timeout 
   36567:2015-10-11T02:25:01.512 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now 
 ... 
 2015-10-11T02:25:07.778 INFO:tasks.ceph.ceph_manager ... recovery 184/5876 objects degraded (3.131%)... 
 ... 
 2015-10-11T02:45:02.844 INFO:tasks.rados.rados.0.target240173.stdout:6593: done (0 left) 
 2015-10-11T02:45:02.929 INFO:tasks.rados.rados.0.target240173.stderr:0 errors. 
 ... 
 2015-10-11T02:45:12.012 INFO:tasks.ceph.ceph_manager ... recovery 2216/23749 objects degraded (9.331%) 
 ... 
 2015-10-11T02:45:12.836 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now 
 2015-10-11T02:45:12.837 ERROR:teuthology.run_tasks:Manager failed: thrashosds 
 </pre> 
 At 02:11 wait_for_recovery starts and sees progress until 02:24. After that time and until the timeout expires, it will no longer see progress until the timeout expires 20 minutes later (1200 seconds) at 02:45. During that interval, the number of degraded object grows from 184 at 02:25 to 2216 at 02:45. Despite of that, the rados workload completed successfully at 02:45. If given a little more time, wait_for_clean would probably have observed progress again and completed the job successfully. 

Back