Bug #6781
closed
timed out waiting for recovery - probably ceph command hang
Added by Samuel Just over 10 years ago.
Updated over 10 years ago.
Description
ubuntu@teuthology:/a/samuelj-2013-11-14_15:22:25-rados-wip-6761-emperor-testing-basic-plana/99764
ceph.log indicates that the osds actually recovered just fine. In all likelihood the ceph pg dump command hung.
ceph pg dump command hung, the test machines [plana33, plana18] hung on the nightlies are available, if someone wants to take a look.
logs: ubuntu@teuthology:/a/teuthology-2013-11-21_19:40:02-upgrade-parallel-master-testing-basic-plana/112925
please ignore the previous comment, looks like the test is still running...
17:28:17.774153 ceph.log shows pgs bumped to 82 (from 72)
17:28:56.294849 last osd map change from ceph.log
17:58:37.860580: ceph.log shows the pgs being clean.
17:59:02.081: teuthology.log reports assertion failed
thrash timeout was 1200 (20 min)
so it seems to me like the monitoring was working properly, but it really did take longer than 20 minutes to recover. as to why, I don't know that, but, Sam, do you agree that the analysis above makes sense?
That sounds like a different bug then. We'd want logs (osd20, filestore20, ms1) to reproduce.
So procedurally what's the right course here? Would the behavior have had anything to do with the branch you were testing? Should we ... change the nightly job to increase logging for the rados-*-testing-basic suite?..
- Status changed from New to Can't reproduce
let's wait for this on a real branch
Also available in: Atom
PDF