Project

General

Profile

Actions

Bug #6226

closed

after editing crushmap and adding new hosts, injecting it, several existing OSD crashed

Added by Jens-Christian Fischer over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
David Zafman
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have edited the crushmap on our 70 OSD, 10 server, 0.61.5 ceph cluster (see attached file) and injected it.

I had trouble with the new OSDs (36, 65, 66, 67, 68, 69) being in the wrong hosts, so I set them using:

ceph osd crush set 68 0.5 root=ssd host=h4ssd

This added them to the correct spot in the hierarchy.

However around 10 of my existing OSD processes crashed with the following logs (see attachment) - search for "assert" to see the crash

Not sure if there is a relation or not....


Files

issue-crush-map.txt (6.3 KB) issue-crush-map.txt Jens-Christian Fischer, 09/04/2013 07:39 AM
ceph-osd.6.log.gz (4.44 MB) ceph-osd.6.log.gz Jens-Christian Fischer, 09/04/2013 07:39 AM
Actions #1

Updated by Ian Colle over 10 years ago

  • Assignee set to Samuel Just
Actions #2

Updated by Samuel Just over 10 years ago

  • Assignee changed from Samuel Just to David Zafman
Actions #3

Updated by David Zafman over 10 years ago

The bug description claims that cluster is running v0.61.5 but attached log says v0.61.7. Could there be a mix of nodes?

I haven't yet been able to reproduce with all machines running v0.61.7.

Actions #4

Updated by Jens-Christian Fischer over 10 years ago

I was wrong - we are indeed on 0.61.7

root@ineri ~$ ndo all_nodes ceph --version
h0 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h1 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h2 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h3 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h4 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
h5 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s0 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s1 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s2 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
s4 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)

This bug has not resurfaced (luckily) and I'm not particularly keen on trying to reproduce it :)

not sure what the best way forward is at this point.

/jc

Actions #5

Updated by David Zafman over 10 years ago

There is a race already fixed in a later release by (01d3e094) which could allow start_recovery_ops() to be called with a negative value for "max" arg. The way to see this assert is for recovery to be required on one or more replicas but not the primary. In that case no replica operations would be started and the code would attempt to transition to Recovered state.

I think we should add a call to needs_recovery() before assuming that recovery must be done. There are other possible obscure error paths that could result in this assert. That change could be backported too.

Actions #6

Updated by David Zafman over 10 years ago

  • Status changed from New to Fix Under Review
Actions #7

Updated by David Zafman over 10 years ago

  • Status changed from Fix Under Review to Resolved

This was already fixed by backport commit 1ea6b561 in v0.61.8 release. See previous comment.

Actions

Also available in: Atom PDF