Bug #6781
closedtimed out waiting for recovery - probably ceph command hang
0%
Description
ubuntu@teuthology:/a/samuelj-2013-11-14_15:22:25-rados-wip-6761-emperor-testing-basic-plana/99764
ceph.log indicates that the osds actually recovered just fine. In all likelihood the ceph pg dump command hung.
Updated by Tamilarasi muthamizhan over 10 years ago
ceph pg dump command hung, the test machines [plana33, plana18] hung on the nightlies are available, if someone wants to take a look.
logs: ubuntu@teuthology:/a/teuthology-2013-11-21_19:40:02-upgrade-parallel-master-testing-basic-plana/112925
Updated by Tamilarasi muthamizhan over 10 years ago
please ignore the previous comment, looks like the test is still running...
Updated by Dan Mick over 10 years ago
17:28:17.774153 ceph.log shows pgs bumped to 82 (from 72)
17:28:56.294849 last osd map change from ceph.log
17:58:37.860580: ceph.log shows the pgs being clean.
17:59:02.081: teuthology.log reports assertion failed
thrash timeout was 1200 (20 min)
so it seems to me like the monitoring was working properly, but it really did take longer than 20 minutes to recover. as to why, I don't know that, but, Sam, do you agree that the analysis above makes sense?
Updated by Samuel Just over 10 years ago
That sounds like a different bug then. We'd want logs (osd20, filestore20, ms1) to reproduce.
Updated by Dan Mick over 10 years ago
So procedurally what's the right course here? Would the behavior have had anything to do with the branch you were testing? Should we ... change the nightly job to increase logging for the rados-*-testing-basic suite?..
Updated by Sage Weil over 10 years ago
- Status changed from New to Can't reproduce
let's wait for this on a real branch