Bug #5518
closedosd: marking single osd down makes others go down (cuttlefish)
0%
Description
Settings:
- paxos propose interval = 1
- debug ms = 1
- debug osd = 20
Log:
07:57: Cluster health OK.
07:58: set logginglevels.
07:59: marked osd.0 down.
08:00: many (+20) osds going down.
08:01: all client I/O stops.
08:02: set "nodown" to prevent whole cluster to go down.
08:03: reset debuglevels to defaults to prevent out-of-space.
08:05: cluster slowly recovering - still many stuck and slow requests.
08:20: many stuck peerings observed.
08:20: two osd segfaults observed - coredumps attached to case.
08:32: cluster back to normal.
08:35: unset "nodown".
Observations:
marking one osd down, took 20+ others down.
many slow requests opberved
client I/O stalled completely
several pg's stuck inactive
stuck peerings observed, had to administratively set osd's down to recover.
Notes:
1) We had to turn logging levels down during the event because of diskspace filling up very fast, and this turned out to be more violent than anticipated. I thought we had enough space for one event, but aparently not. Some log fs's filled up. - However, all osd "down" events was captured in its entirety, only some of the subsequent peering is missing. - Let me know if you need us to make another test. - We have osd logs on their data volumes now.
2) Note that this happened with paxos propose interval = 1.
3) One interesting note; First OSD's seen down (many others went down later though):
- osd.156
- osd.177
- osd.188
- osd.215
- osd.226
- osd.227
- osd.248
- osd.251
- osd.260
- osd.235
- osd.241
All of which are in the cph1c16 rack. Where as osd.0, which started the party is located in the cph1f11 rack. Those two racks share all the primary osd's, so osd.0 do not peer with any other osd's in cph1f11, but only with osd's in cph1c16 and cph2i11. cph2i11 only holds non-primaries.
4) osd.24 and osd.36 both segfaulted during peering. coredumps are uploaded in the .tar.gz
5) It looks as if the cascading down-marking is more dramatic if the cluster has run for a long time without any osd's being down. - This test was carried out after 48 hours of stable osd's.
6) It looks as if osd segfaults are related to these peering events somehow. It has been more than 48 hours since we had a segfault, but during the test we observed two. (And during the three day backfilling, we saw many segfaults happen).
7) All osd/mon logs and coredumps are cephdropped as */*_peering-kick-test_1.tar.gz (42GB unpacked)
Updated by Sage Weil almost 11 years ago
2013-07-08 11:45:44.098012 7f1b830c9700 10 osd.91 157333 send_incremental_map 125829 -> 157333 to 0x5adc7ce0 10.81.144.109:6821/4751
in osd.91's log
this loops for ages, blocks op_tp, and the heartbeats stop
Updated by Sage Weil almost 11 years ago
- Status changed from 12 to Fix Under Review
Updated by Samuel Just almost 11 years ago
We will usually have more than the most recent 25 in cache, maybe a larger default?
Updated by Sage Weil almost 11 years ago
hmm, yeah.. how about 100? that's also the max # of maps we will shove in a single MOSDMap message. otherwise ok?
Updated by Sage Weil almost 11 years ago
- Status changed from Fix Under Review to Resolved
Updated by Josh West almost 11 years ago
Hi Sage,
Was this resolved with a patch to Cuttlefish, that should make it in the next minor release?
Thanks.
--Josh West
Updated by Sage Weil almost 11 years ago
Hi Josh-
yes, this is fixed in the latest cuttlefish (0.61.5, about to send out the announcement now). 78f226634bd80f6678b1f74ccf785bc52fcd6b62