Project

General

Profile

Actions

Bug #4812

closed

mon/Monitor.cc: 1107: FAILED assert(0 == "Unable to find a new monitor to connect to. Not cool.")

Added by Samuel Just almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

0> 2013-04-25 05:52:00.052720 7f126a7fc700 -1 mon/Monitor.cc: In function 'void Monitor::sync_timeout(entity_inst_t&)' thread 7f126a7fc700 time 2013-04-25 05:52:00.051740
mon/Monitor.cc: 1107: FAILED assert(0 == "Unable to find a new monitor to connect to. Not cool.")

ceph version 0.60-653-gf480484 (f4804849b7644f2c1dfd92404682f510a88e9a23)
1: (Monitor::sync_timeout(entity_inst_t&)+0x4f7) [0x4bf5e7]
2: (Context::complete(int)+0xa) [0x4c70ba]
3: (SafeTimer::timer_thread()+0x425) [0x643425]
4: (SafeTimerThread::entry()+0xd) [0x64405d]
5: (()+0x7e9a) [0x7f1272d07e9a]
6: (clone()+0x6d) [0x7f1271524cbd]

Actions #1

Updated by Samuel Just almost 11 years ago

ubuntu@teuthology:/a/teuthology-2013-04-25_01:00:08-rados-next-testing-basic/587/

Actions #2

Updated by Greg Farnum almost 11 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Greg Farnum
Actions #3

Updated by Greg Farnum almost 11 years ago

Okay, this is sort of what was supposed to happen, I think. mon c stopped responding to mon b's sync queries, and it couldn't find somebody else to sync from, so it asserted out.

However, I don't think mon c should have stopped responding — there was an election and general reset, though, so I bet that's the issue. Not certain who or what is supposed to handle this yet.

Actions #4

Updated by Greg Farnum almost 11 years ago

Yep, bootstrap() calls reset_sync(). So c dropped b's sync on the floor, and then b timed out of course. Was it supposed to restart on its own somehow?

Nor can I figure out why b didn't try and connect to a.

Actions #5

Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to 4
  • Priority changed from Urgent to High

The only way I can see this assert happening is if b randomly selected the previously-chosen monitor (c) or itself 6 times in a row. Which I guess isn't that impossible.
If it hadn't done that, then b would have tried to continue its sync from a.

I'm not sure this is a release blocker, though — just turning the MDS back on should be fine. I'm similarly not sure why this is an assert instead of a graceful shutdown or something. I want to discuss the intended design before making a permanent solution.

However, there's a band-aid fix (untested) in wip-4812 that would have prevented it in this case.

Actions #6

Updated by Greg Farnum almost 11 years ago

  • Status changed from 4 to Resolved

Merged into next in 5fa3cbf520f5aeb9e0101c1263f681542d3069a5
Created #4835 to track the other issues I raised.

Actions

Also available in: Atom PDF