Bug #6781: timed out waiting for recovery - probably ceph command hang - Ceph - Ceph

Actions

Copy link

Bug #6781

closed

timed out waiting for recovery - probably ceph command hang

Added by Samuel Just over 10 years ago. Updated over 10 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Dan Mick

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

emperor

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ubuntu@teuthology:/a/samuelj-2013-11-14_15:22:25-rados-wip-6761-emperor-testing-basic-plana/99764

ceph.log indicates that the osds actually recovered just fine. In all likelihood the ceph pg dump command hung.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Dan Mick

Actions

Copy link

Updated by Tamilarasi muthamizhan over 10 years ago

ceph pg dump command hung, the test machines [plana33, plana18] hung on the nightlies are available, if someone wants to take a look.

logs: ubuntu@teuthology:/a/teuthology-2013-11-21_19:40:02-upgrade-parallel-master-testing-basic-plana/112925

Actions

Copy link

Updated by Tamilarasi muthamizhan over 10 years ago

please ignore the previous comment, looks like the test is still running...

Actions

Copy link

Updated by Dan Mick over 10 years ago

17:28:17.774153 ceph.log shows pgs bumped to 82 (from 72)
17:28:56.294849 last osd map change from ceph.log

17:58:37.860580: ceph.log shows the pgs being clean.

17:59:02.081: teuthology.log reports assertion failed

thrash timeout was 1200 (20 min)

so it seems to me like the monitoring was working properly, but it really did take longer than 20 minutes to recover. as to why, I don't know that, but, Sam, do you agree that the analysis above makes sense?

Actions

Copy link

Updated by Samuel Just over 10 years ago

That sounds like a different bug then. We'd want logs (osd20, filestore20, ms1) to reproduce.

Actions

Copy link

Updated by Dan Mick over 10 years ago

So procedurally what's the right course here? Would the behavior have had anything to do with the branch you were testing? Should we ... change the nightly job to increase logging for the rados-*-testing-basic suite?..

Actions

Copy link

Updated by Sage Weil over 10 years ago

Status changed from New to Can't reproduce

let's wait for this on a real branch

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #6781

timed out waiting for recovery - probably ceph command hang

Updated by Ian Colle over 10 years ago

Updated by Tamilarasi muthamizhan over 10 years ago

Updated by Tamilarasi muthamizhan over 10 years ago

Updated by Dan Mick over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Dan Mick over 10 years ago

Updated by Sage Weil over 10 years ago