Project

General

Profile

Actions

Bug #6781

closed

timed out waiting for recovery - probably ceph command hang

Added by Samuel Just over 10 years ago. Updated over 10 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
emperor
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/samuelj-2013-11-14_15:22:25-rados-wip-6761-emperor-testing-basic-plana/99764

ceph.log indicates that the osds actually recovered just fine. In all likelihood the ceph pg dump command hung.


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #6776: nightly failure: timed out waiting for admin_socket after osd restartedDuplicate11/14/2013

Actions
Actions #1

Updated by Ian Colle over 10 years ago

  • Assignee set to Dan Mick
Actions #2

Updated by Tamilarasi muthamizhan over 10 years ago

ceph pg dump command hung, the test machines [plana33, plana18] hung on the nightlies are available, if someone wants to take a look.

logs: ubuntu@teuthology:/a/teuthology-2013-11-21_19:40:02-upgrade-parallel-master-testing-basic-plana/112925

Actions #3

Updated by Tamilarasi muthamizhan over 10 years ago

please ignore the previous comment, looks like the test is still running...

Actions #4

Updated by Dan Mick over 10 years ago

17:28:17.774153 ceph.log shows pgs bumped to 82 (from 72)
17:28:56.294849 last osd map change from ceph.log

17:58:37.860580: ceph.log shows the pgs being clean.

17:59:02.081: teuthology.log reports assertion failed

thrash timeout was 1200 (20 min)

so it seems to me like the monitoring was working properly, but it really did take longer than 20 minutes to recover. as to why, I don't know that, but, Sam, do you agree that the analysis above makes sense?

Actions #5

Updated by Samuel Just over 10 years ago

That sounds like a different bug then. We'd want logs (osd20, filestore20, ms1) to reproduce.

Actions #6

Updated by Dan Mick over 10 years ago

So procedurally what's the right course here? Would the behavior have had anything to do with the branch you were testing? Should we ... change the nightly job to increase logging for the rados-*-testing-basic suite?..

Actions #7

Updated by Sage Weil over 10 years ago

  • Status changed from New to Can't reproduce

let's wait for this on a real branch

Actions

Also available in: Atom PDF