Project

General

Profile

Actions

Bug #5518

closed

osd: marking single osd down makes others go down (cuttlefish)

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Settings:
- paxos propose interval = 1
- debug ms = 1
- debug osd = 20

Log:
07:57: Cluster health OK.
07:58: set logginglevels.
07:59: marked osd.0 down.
08:00: many (+20) osds going down.
08:01: all client I/O stops.
08:02: set "nodown" to prevent whole cluster to go down.
08:03: reset debuglevels to defaults to prevent out-of-space.
08:05: cluster slowly recovering - still many stuck and slow requests.
08:20: many stuck peerings observed.
08:20: two osd segfaults observed - coredumps attached to case.
08:32: cluster back to normal.
08:35: unset "nodown".

Observations:
marking one osd down, took 20+ others down.
many slow requests opberved
client I/O stalled completely
several pg's stuck inactive
stuck peerings observed, had to administratively set osd's down to recover.

Notes:
1) We had to turn logging levels down during the event because of diskspace filling up very fast, and this turned out to be more violent than anticipated. I thought we had enough space for one event, but aparently not. Some log fs's filled up. - However, all osd "down" events was captured in its entirety, only some of the subsequent peering is missing. - Let me know if you need us to make another test. - We have osd logs on their data volumes now.

2) Note that this happened with paxos propose interval = 1.

3) One interesting note; First OSD's seen down (many others went down later though):
- osd.156
- osd.177
- osd.188
- osd.215
- osd.226
- osd.227
- osd.248
- osd.251
- osd.260
- osd.235
- osd.241
All of which are in the cph1c16 rack. Where as osd.0, which started the party is located in the cph1f11 rack. Those two racks share all the primary osd's, so osd.0 do not peer with any other osd's in cph1f11, but only with osd's in cph1c16 and cph2i11. cph2i11 only holds non-primaries.

4) osd.24 and osd.36 both segfaulted during peering. coredumps are uploaded in the .tar.gz

5) It looks as if the cascading down-marking is more dramatic if the cluster has run for a long time without any osd's being down. - This test was carried out after 48 hours of stable osd's.

6) It looks as if osd segfaults are related to these peering events somehow. It has been more than 48 hours since we had a segfault, but during the test we observed two. (And during the three day backfilling, we saw many segfaults happen).

7) All osd/mon logs and coredumps are cephdropped as */*_peering-kick-test_1.tar.gz (42GB unpacked)

Actions #1

Updated by Ian Colle almost 11 years ago

  • Assignee set to Samuel Just
Actions #2

Updated by Sage Weil almost 11 years ago

2013-07-08 11:45:44.098012 7f1b830c9700 10 osd.91 157333 send_incremental_map 125829 -> 157333 to 0x5adc7ce0 10.81.144.109:6821/4751

in osd.91's log

this loops for ages, blocks op_tp, and the heartbeats stop

Actions #3

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to Fix Under Review
Actions #4

Updated by Samuel Just almost 11 years ago

We will usually have more than the most recent 25 in cache, maybe a larger default?

Actions #5

Updated by Sage Weil almost 11 years ago

hmm, yeah.. how about 100? that's also the max # of maps we will shove in a single MOSDMap message. otherwise ok?

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by Josh West almost 11 years ago

Hi Sage,

Was this resolved with a patch to Cuttlefish, that should make it in the next minor release?

Thanks.

--Josh West

Actions #8

Updated by Sage Weil almost 11 years ago

Hi Josh-

yes, this is fixed in the latest cuttlefish (0.61.5, about to send out the announcement now). 78f226634bd80f6678b1f74ccf785bc52fcd6b62

Actions

Also available in: Atom PDF