Bug #6226: after editing crushmap and adding new hosts, injecting it, several existing OSD crashed - Ceph - Ceph

Actions

Copy link

Bug #6226

closed

after editing crushmap and adding new hosts, injecting it, several existing OSD crashed

Added by Jens-Christian Fischer over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

David Zafman

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have edited the crushmap on our 70 OSD, 10 server, 0.61.5 ceph cluster (see attached file) and injected it.

I had trouble with the new OSDs (36, 65, 66, 67, 68, 69) being in the wrong hosts, so I set them using:

ceph osd crush set 68 0.5 root=ssd host=h4ssd

This added them to the correct spot in the hierarchy.

However around 10 of my existing OSD processes crashed with the following logs (see attachment) - search for "assert" to see the crash

Not sure if there is a relation or not....

Files

Download all files

issue-crush-map.txt (6.3 KB) issue-crush-map.txt		Jens-Christian Fischer, 09/04/2013 07:39 AM
ceph-osd.6.log.gz (4.44 MB) ceph-osd.6.log.gz		Jens-Christian Fischer, 09/04/2013 07:39 AM

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Samuel Just over 10 years ago

Assignee changed from Samuel Just to David Zafman

Actions

Copy link

Updated by David Zafman over 10 years ago

The bug description claims that cluster is running v0.61.5 but attached log says v0.61.7. Could there be a mix of nodes?

I haven't yet been able to reproduce with all machines running v0.61.7.

Actions

Copy link

Updated by Jens-Christian Fischer over 10 years ago

I was wrong - we are indeed on 0.61.7

root@ineri ~$ ndo all_nodes ceph --version
h0 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h1 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h2 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h3 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h4 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h5 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s0 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s1 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s2 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s4 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)

This bug has not resurfaced (luckily) and I'm not particularly keen on trying to reproduce it :)

not sure what the best way forward is at this point.

/jc

Actions

Copy link

Updated by David Zafman over 10 years ago

There is a race already fixed in a later release by (01d3e094) which could allow start_recovery_ops() to be called with a negative value for "max" arg. The way to see this assert is for recovery to be required on one or more replicas but not the primary. In that case no replica operations would be started and the code would attempt to transition to Recovered state.

I think we should add a call to needs_recovery() before assuming that recovery must be done. There are other possible obscure error paths that could result in this assert. That change could be backported too.

Actions

Copy link

Updated by David Zafman over 10 years ago

Status changed from New to Fix Under Review

Actions

Copy link

Updated by David Zafman over 10 years ago

Status changed from Fix Under Review to Resolved

This was already fixed by backport commit 1ea6b561 in v0.61.8 release. See previous comment.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #6226

after editing crushmap and adding new hosts, injecting it, several existing OSD crashed

Updated by Ian Colle over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by David Zafman over 10 years ago

Updated by Jens-Christian Fischer over 10 years ago

Updated by David Zafman over 10 years ago

Updated by David Zafman over 10 years ago

Updated by David Zafman over 10 years ago