Bug #4812
closedmon/Monitor.cc: 1107: FAILED assert(0 == "Unable to find a new monitor to connect to. Not cool.")
0%
Description
0> 2013-04-25 05:52:00.052720 7f126a7fc700 -1 mon/Monitor.cc: In function 'void Monitor::sync_timeout(entity_inst_t&)' thread 7f126a7fc700 time 2013-04-25 05:52:00.051740
mon/Monitor.cc: 1107: FAILED assert(0 == "Unable to find a new monitor to connect to. Not cool.")
ceph version 0.60-653-gf480484 (f4804849b7644f2c1dfd92404682f510a88e9a23)
1: (Monitor::sync_timeout(entity_inst_t&)+0x4f7) [0x4bf5e7]
2: (Context::complete(int)+0xa) [0x4c70ba]
3: (SafeTimer::timer_thread()+0x425) [0x643425]
4: (SafeTimerThread::entry()+0xd) [0x64405d]
5: (()+0x7e9a) [0x7f1272d07e9a]
6: (clone()+0x6d) [0x7f1271524cbd]
Updated by Samuel Just about 11 years ago
ubuntu@teuthology:/a/teuthology-2013-04-25_01:00:08-rados-next-testing-basic/587/
Updated by Greg Farnum about 11 years ago
- Status changed from 12 to In Progress
- Assignee set to Greg Farnum
Updated by Greg Farnum about 11 years ago
Okay, this is sort of what was supposed to happen, I think. mon c stopped responding to mon b's sync queries, and it couldn't find somebody else to sync from, so it asserted out.
However, I don't think mon c should have stopped responding — there was an election and general reset, though, so I bet that's the issue. Not certain who or what is supposed to handle this yet.
Updated by Greg Farnum about 11 years ago
Yep, bootstrap() calls reset_sync(). So c dropped b's sync on the floor, and then b timed out of course. Was it supposed to restart on its own somehow?
Nor can I figure out why b didn't try and connect to a.
Updated by Greg Farnum about 11 years ago
- Status changed from In Progress to 4
- Priority changed from Urgent to High
The only way I can see this assert happening is if b randomly selected the previously-chosen monitor (c) or itself 6 times in a row. Which I guess isn't that impossible.
If it hadn't done that, then b would have tried to continue its sync from a.
I'm not sure this is a release blocker, though — just turning the MDS back on should be fine. I'm similarly not sure why this is an assert instead of a graceful shutdown or something. I want to discuss the intended design before making a permanent solution.
However, there's a band-aid fix (untested) in wip-4812 that would have prevented it in this case.
Updated by Greg Farnum about 11 years ago
- Status changed from 4 to Resolved
Merged into next in 5fa3cbf520f5aeb9e0101c1263f681542d3069a5
Created #4835 to track the other issues I raised.