Bug #3495

ceph-mon crash

Added by Artem Grinblat almost 2 years ago. Updated over 1 year ago.

Status:ResolvedStart date:11/15/2012
Priority:UrgentDue date:
Assignee:Joao Luis% Done:

0%

Category:MonitorSpent time:-
Target version:-
Source:Community (user) Severity:
Backport: Reviewed:
Tags: Suite:

Description

Have found a core of ceph-mon in /.

Ceph is from "deb http://ceph.com/debian-testing/ wheezy main":
  1. ceph -v
    ceph version 0.54 (60b84b095b1009a305d4d6a5b16f88571cbd3150)

crash.7z - core, binary, logs, config (9.1 MB) Artem Grinblat, 11/15/2012 07:00 am

monCrash.tgz - file created with `ceph-debugpack` (1.29 MB) Artem Grinblat, 11/15/2012 07:00 am

failed_in_logs1.png - Searching for "FAILED assert(have_pending)" (78.9 KB) Artem Grinblat, 11/23/2012 11:30 am

failed_in_logs2.png - Found a lot of occurences in the logs (43.3 KB) Artem Grinblat, 11/23/2012 11:30 am

ceph-mon.3495.7z - logs (debug mon = 20; debug ms = 10), core, binary (25.7 MB) Artem Grinblat, 12/05/2012 12:56 am

ceph.3495.rados20.7z - paxos=10 log (26.1 MB) Artem Grinblat, 12/05/2012 11:07 am


Related issues

Duplicated by Bug #3618: 0.55.1 ceph-mon crashes under heavy load Duplicate 12/13/2012
Duplicated by Bug #4737: ceph 0.60 mon crashed: 'void PaxosService::propose_pending() Duplicate 04/17/2013

Associated revisions

Revision 96b82ebf
Added by Joao Luis almost 2 years ago

mon: Monitor: wake up contexts based on paxos machine's state

When recovering the leader, only wake up a paxos machine's contexts if
the paxos machine is in a state that can handle said contexts.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision 96b82ebf
Added by Joao Luis almost 2 years ago

mon: Monitor: wake up contexts based on paxos machine's state

When recovering the leader, only wake up a paxos machine's contexts if
the paxos machine is in a state that can handle said contexts.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision f726d8e9
Added by Joao Luis almost 2 years ago

mon: Monitor: wake up contexts based on paxos machine's state

When recovering the leader, only wake up a paxos machine's contexts if
the paxos machine is in a state that can handle said contexts.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision 4012514a
Added by Joao Luis almost 2 years ago

mon: Monitor: rework 'paxos' to a list instead of a vector

After adding the gv patches, during Monitor::recovered_leader() we started
waking up contexts following the order of the 'paxos' vector. However,
given that the mdsmon has a forgotten dependency on the osdmon paxos
machine, we were incurring in a situation in which we proposed a value
through the osdmon before creating a new pending value (but by being
active, the mdsmon would go through with it nonetheless).

This is easily fixed by making sure that the mdsmon callbacks are only
awaken after the osdmon has been taken care of.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision e1c27fe1
Added by Joao Luis almost 2 years ago

mon: Monitor: rework 'paxos' to a list instead of a vector

After adding the gv patches, during Monitor::recovered_leader() we started
waking up contexts following the order of the 'paxos' vector. However,
given that the mdsmon has a forgotten dependency on the osdmon paxos
machine, we were incurring in a situation in which we proposed a value
through the osdmon before creating a new pending value (but by being
active, the mdsmon would go through with it nonetheless).

This is easily fixed by making sure that the mdsmon callbacks are only
awaken after the osdmon has been taken care of.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision e1c27fe1
Added by Joao Luis almost 2 years ago

mon: Monitor: rework 'paxos' to a list instead of a vector

After adding the gv patches, during Monitor::recovered_leader() we started
waking up contexts following the order of the 'paxos' vector. However,
given that the mdsmon has a forgotten dependency on the osdmon paxos
machine, we were incurring in a situation in which we proposed a value
through the osdmon before creating a new pending value (but by being
active, the mdsmon would go through with it nonetheless).

This is easily fixed by making sure that the mdsmon callbacks are only
awaken after the osdmon has been taken care of.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision 0b30cfa6
Added by Joao Luis almost 2 years ago

mon: Monitor: rework 'paxos' to a list instead of a vector

After adding the gv patches, during Monitor::recovered_leader() we started
waking up contexts following the order of the 'paxos' vector. However,
given that the mdsmon has a forgotten dependency on the osdmon paxos
machine, we were incurring in a situation in which we proposed a value
through the osdmon before creating a new pending value (but by being
active, the mdsmon would go through with it nonetheless).

This is easily fixed by making sure that the mdsmon callbacks are only
awaken after the osdmon has been taken care of.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision cae2c660
Added by Joao Luis over 1 year ago

mon: MDSMonitor: tighter leash on cross-proposals to the osdmon

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

Revision f42fc0e4
Added by Joao Luis over 1 year ago

mon: MDSMonitor: tighter leash on cross-proposals to the osdmon

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <>

History

#1 Updated by Joao Luis almost 2 years ago

  • Category set to Monitor
  • Status changed from New to In Progress
  • Assignee set to Joao Luis
  • Priority changed from Normal to High

Thanks for reporting it. Will look into it.

By the way, this is what the crash looks like:

  -171> 2012-11-15 18:32:33.937581 7f438b594700 -1 mon/PaxosService.cc: In function 'void PaxosService::propose_pending()' thread 7f438b594700 time 2012-11-15 18:32:33.936425
mon/PaxosService.cc: 110: FAILED assert(have_pending)

 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
 1: (PaxosService::propose_pending()+0x29b) [0x4bad8b]
 2: (MDSMonitor::tick()+0xda3) [0x4f08e3]
 3: (MDSMonitor::on_active()+0x11) [0x4e8a51]
 4: (PaxosService::_active()+0x22a) [0x4bb02a]
 5: (Context::complete(int)+0xa) [0x49d46a]
 6: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xbc) [0x4a20ac]
 7: (Monitor::recovered_leader(int)+0x36d) [0x49135d]
 8: (Paxos::handle_accept(MMonPaxos*)+0x511) [0x4b5691]
 9: (Paxos::dispatch(PaxosServiceMessage*)+0x20b) [0x4b8d7b]
 10: (Monitor::_ms_dispatch(Message*)+0xb7a) [0x49ca3a]
 11: (Monitor::ms_dispatch(Message*)+0x32) [0x4ae942]
 12: (DispatchQueue::entry()+0x68a) [0x648e1a]
 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x5e505d]
 14: (()+0x6b50) [0x7f4391974b50]
 15: (clone()+0x6d) [0x7f439032570d]

#2 Updated by Joao Luis almost 2 years ago

It appears that this crash is somewhat related with an election, and then something made the 'have_pending' variable to become false after 'PaxosService::_active()'.

Given the log level on the monitor, there is no way to assess if a new proposal was created, but here's the thing: if 'have_pending' was false from the get go, then a new pending would be created in PaxosService::_active(); if it was true, then a new proposal would have to be proposed mid-'::_active()' and make us incur in some sort of weird race -- but there is no evidence that such thing happened.

Looking at the log we notice that, in the minutes before the failed assertion, there was a couple of successive elections:

2012-11-15 11:42:09.520194 7f76b45db700 -1 mon.a@1(peon) e1 *** Got Signal Terminated ***
2012-11-15 11:42:09.520217 7f76b45db700  1 mon.a@1(peon) e1 shutdown

 [monitor starts]

2012-11-15 11:42:22.200331 7f82b63ee780  1 mon.a@-1(probing) e1 init fsid 24f77b4c-ce12-4071-9746-affad8c0c3ba
2012-11-15 11:42:22.297173 7f82b63ee780  0 mon.a@-1(probing) e1  my rank is now 1 (was -1)
2012-11-15 11:42:22.298234 7f82ae9e2700  1 mon.a@1(probing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum
2012-11-15 11:42:22.298269 7f82ae9e2700  1 mon.a@1(probing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum
 [snip duplicate messages]

2012-11-15 11:42:26.299568 7f82ae9e2700  0 log [INF] : mon.a calling new monitor election
 [snip more discarded auth messages]

 [election finished, mon.a is a peon]
2012-11-15 11:42:36.879704 7f82ae9e2700 -1 mon.a@1(peon).paxos(logm updating c 2693275..2693776) lease_expire from mon.0 85.25.26.225:6789/0 is 0.765689 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:38.163912 7f82ae9e2700 -1 mon.a@1(peon).paxos(auth updating c 5710..5731) lease_expire from mon.0 85.25.26.225:6789/0 is 0.607050 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:38.215428 7f82ae9e2700 -1 mon.a@1(peon).paxos(pgmap active c 2639017..2639518) lease_expire from mon.0 85.25.26.225:6789/0 is 0.486926 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:38.215598 7f82ae9e2700 -1 mon.a@1(peon).paxos(mdsmap active c 1..5608) lease_expire from mon.0 85.25.26.225:6789/0 is 0.487026 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:41.804620 7f82ae9e2700  1 mon.a@1(peon).osd e11052 e11052: 4 osds: 3 up, 4 in
2012-11-15 11:42:45.802772 7f82ae9e2700  1 mon.a@1(peon).osd e11053 e11053: 4 osds: 2 up, 4 in
2012-11-15 11:42:48.219469 7f82ae9e2700  1 mon.a@1(peon).osd e11054 e11054: 4 osds: 4 up, 4 in
2012-11-15 11:42:50.223239 7f82ae9e2700  1 mon.a@1(peon).osd e11055 e11055: 4 osds: 3 up, 4 in
2012-11-15 11:43:10.140877 7f82ae9e2700  0 log [INF] : mon.a calling new monitor election
2012-11-15 11:43:16.255093 7f82afcf5700  0 -- 188.138.84.236:6789/0 >> 85.25.26.225:6789/0 pipe(0x1d3d8c0 sd=24 :6789 pgs=12 cs=1 l=0).fault with nothing to send, going to standby
2012-11-15 11:43:18.312369 7f82ae9e2700  1 mon.a@1(peon).osd e11056 e11056: 4 osds: 4 up, 4 in
2012-11-15 11:43:19.007349 7f82ae9e2700  1 mon.a@1(peon).osd e11057 e11057: 4 osds: 4 up, 4 in
2012-11-15 11:43:26.848048 7f82ae9e2700  0 log [INF] : mon.a calling new monitor election

 [new election underway]

2012-11-15 11:43:42.271375 7f82ae9e2700  1 mon.a@1(electing) e1 discarding message mon_subscribe({monmap=2+,osdmap=11058}) v2 and sending client elsewhere; we are not in quorum
2012-11-15 11:43:47.373292 7f82ae1e1700  0 log [INF] : mon.a@1 won leader election with quorum 1,2

Soon after, the assertion failed.

Given the code path, there's a chance that a Paxos proposal was underway too. The only potential reference to this comes in the form of a 'OSDMonitor::prepare_failure()' moments before the crash:

2012-11-15 11:43:56.324468 7f82ae9e2700  1 mon.a@1(leader).osd e11058 prepare_failure osd.0 188.138.84.236:6801/12408 from osd.3 85.25.99.55:6800/29613 is reporting failure:1
2012-11-15 11:43:56.324486 7f82ae9e2700  0 log [DBG] : osd.0 188.138.84.236:6801/12408 reported failed by osd.3 85.25.99.55:6800/29613

But given that we have no more informations than this, the only thing that makes sense is that there was an on-going MDSMonitor proposal when we woke everyone up in 'Monitor::recovered_leader()'. I'm working on a patch that will make sure that we only wake up a paxos machine's callbacks when the paxos machine is active, and this should not happen again, but will be tough to confirm this as the root cause.

#3 Updated by Joao Luis almost 2 years ago

  • Status changed from In Progress to Need Review

wip-mon-misc-fixes contains commit 96b82ebf870bda5af33624f698785bac6ae78234 that should fix this bug.

The approach is basically to close any window that may lead to awakening contexts waiting on any one of the queues (such as waiting_for_active) while a paxos machine is not in a state that could cope with the contexts being woken up.

This appears to be the reason for this bug. For instance, say that we have a 'C_Active' context waiting on the MDSMonitor::waiting_for_active. Now say that the MDSMonitor is about to start a proposal. Now imagine that another paxos machine (say, the OSDMonitor) finishes its proposal, ends up in Monitor::recovered_leader() and starts waking up all the contexts in all the Paxos machines. If it awakes that 'C_Active' context on the MDSMonitor at the wrong time, nasty things may happen.

This patch may or may not fix this issue (as we have no way to confirm using the logs), but it certainly fixes some issue that may or may not have happened yet.

#4 Updated by Artem Grinblat almost 2 years ago

Joao, what kind of ceph configuration would make the logs more useful in the future?
Ceph crashes fairly often here, maybe I would be able to reproduce the issue with a more detailed logs. See screenshots.

#5 Updated by Joao Luis almost 2 years ago

  • Status changed from Need Review to Resolved

#6 Updated by Joao Luis almost 2 years ago

Sorry, marked this as resolved and only noticed the last comment afterwards (silly me).

For the monitor, a good log level of 20 for the monitor and a 10 for ms are probably the most useful in terms of the information gathered, but they will also grow your logs considerably. If you are not able to cope with that amount of logging, then we can obtain some fair amount of infos from a debug level of 10 on the monitor and ms 1.

In terms of configuration, that would be something like

[mon]
  debug mon = 20 (or 10)
  debug ms = 10 (or 1)

I'm pretty confident that the patch will fix this issue, but if you happen to stumble across this bug again, please feel free to poke me ;-)

#7 Updated by Artem Grinblat almost 2 years ago

"FAILED assert(have_pending)" still happens with Ceph 0.55.
I have attached the logs.

#8 Updated by Joao Luis almost 2 years ago

  • Status changed from Resolved to In Progress

Thanks for following up on this. And thanks for cranking up the debug levels! :-)

#9 Updated by Joao Luis almost 2 years ago

Artem,

The good news is that, given the bug being triggered yet again on the same spot, it would appear to be localized to the MDSMonitor; the not-so-good news is that I'm still looking for the root cause.

Are you doing anything in particular to trigger this bug? I would love to be able to reproduce it. Also, if you could set 'debug paxos = 10' on the monitor it would be great, as it would allow me to pinpoint what is causing the 'have_pending' to turn false -- debugging paxos usually isn't relevant enough to be worth to spam your logs, thus why I didn't suggested it earlier, but in this case it would be nice to have it.

#10 Updated by Artem Grinblat almost 2 years ago

I'm not doing anything special, the usual librados uploads and removes. There is also a recovery in background: "recovery 178902/6622808 degraded (2.701%)".
I'll upload the 'debug paxos = 10' logs if (when) the mon crashes again.

#11 Updated by Artem Grinblat almost 2 years ago

Are you doing anything in particular to trigger this bug?

With Ceph 0.54 I've noticed that

      shared_ptr<std::forward_list<gstring>> ids (std::make_shared<std::forward_list<gstring>>());
      for (auto it = rados.second->objects_begin(), end = rados.second->objects_end(); it != end; ++it) {
        ids->emplace_front (it->first);
        if (++*cephTotal % 100 == 0) std::this_thread::sleep_for (std::chrono::milliseconds (valgrind ? 200 : 20)); // Be nice to Ceph.
      }

always fails sooner or later, even with the loop running against a thread-local Ceph connections, whereas "rados -p {pool} ls" (which does exactly the same loop) - works. So I have a suspicion that Ceph isn't tested that well when the same process uses librados from multiple threads. Don't know if it is related to this bug or not.

#12 Updated by Artem Grinblat almost 2 years ago

(As a workaround I'm now doing "nice rados -p {pool} ls > {pool}.ids" and reading ids from the file).

#13 Updated by Josh Durgin almost 2 years ago

Could you open a separate bug with the crash you see with the multithreaded object listing? There is some shared state for IoCtxs that shouldn't be relied on from multiple threads, but listing objects shouldn't hit that.

#14 Updated by Artem Grinblat almost 2 years ago

Could you open a separate bug with the crash you see with the multithreaded object listing?

I might be able later.
What would be the best logging settings for that report? Thanks.

#15 Updated by Josh Durgin almost 2 years ago

debug ms = 1
debug objecter = 20
debug rados = 20

#16 Updated by Artem Grinblat almost 2 years ago

Attached is the file with a paxos=10 log.

#17 Updated by Joao Luis almost 2 years ago

Thanks Artem. Your log confirmed my suspicions regarding to what the bug was.

If you have the chance, could you try commit 46c55a19e72ae1f16037807fda8abe59b671f7d3 on wip-3495? It should fix the issue, but I haven't been able to reproduce the bug in order to be 100% sure.

#18 Updated by Artem Grinblat almost 2 years ago

If you have the chance, could you try commit 46c55a19e72ae1f16037807fda8abe59b671f7d3 on wip-3495? It should fix the issue, but I haven't been able to reproduce the bug in order to be 100% sure.

Joao, thanks for the fix!
I'd prefer to wait for 0.56.

#19 Updated by Joao Luis almost 2 years ago

Sure. I'll try to reproduce this bug again before bobtail is released. From the looks of it, messing with the MDS enough while bringing the cluster up ought to trigger it, so I might be able to do it with.

Thanks for reporting it and all your help tracking it!

#20 Updated by Joao Luis almost 2 years ago

Managed to trigger this bug using

$ while [ 1 ]; do ./init-ceph restart mon.a ; sleep 30 ; done
$ while [ 1 ]; do ./init-ceph restart mds.a ; sleep 2 ; done
$ while [ 1 ]; do ./init-ceph restart osd.1 ; sleep 2 ; done

The bug was triggered on mon.b, everything running v0.55.

#21 Updated by Matthew Via almost 2 years ago

I was encountering this particular bug, causing all my monitors to crash shortly after start. I patched my 0.55 packages with the patch from wip-3495, and have not encountered a single crash under heavy load for about 15 hours.

#22 Updated by Joao Luis almost 2 years ago

Good to know Matthew. I haven't been able to reproduce it either since using a patched version.

And the patch has been merged onto next, so I'm marking the bug as resolved.

#23 Updated by Joao Luis almost 2 years ago

  • Status changed from In Progress to Resolved

#24 Updated by Joao Luis almost 2 years ago

  • Status changed from Resolved to In Progress

This bug occurred again on 0.55.1, as described on bug #3618. Re-opening as it appears that there's still something else going on.

#25 Updated by Joao Luis almost 2 years ago

  • Status changed from In Progress to Resolved

Sorry, jumped the gun here. Thought the fix was on 0.55.1 but it is not. It's on next though.

#26 Updated by Joao Luis over 1 year ago

  • Status changed from Resolved to In Progress

Matthew Roy, imjustmattew on IRC, stumbled upon this bug yet again, this time on 0.60.

This was reintroduced with the monitor rework which basically undid the patch introduced with e1c27fe178bb76b2f3127876b868968730bdb074 and I suspect the solution is to redo the patch.

#27 Updated by Joao Luis over 1 year ago

Well, my above comment is not entirely true. The patch itself was not undone, but rather the patch's focus was on the multiple Paxos instances that were used back then. Currently, we only have one Paxos instance, so we should focus on the services.

While the obvious workaround is to force a given order on how we poke the services, in order to avoid the MDSMonitor invoking the OSDMonitor's propose_pending() before the OSDMonitor has been poked. This however seems wrong in the sense that we should instead force the MDSMonitor to wait for the OSDMonitor to become active prior to performing write operations.

#28 Updated by Joao Luis over 1 year ago

  • Status changed from In Progress to Need Review

Pushed wip-3495 to github. So far, the fix appears to be holding up without breaking anything else.

I'm leaving the test running for the night and should be more certain by the morning.

In the meantime, reviews are welcome.

#29 Updated by Joao Luis over 1 year ago

Fix has been holding for the last 12 hours or so.

#30 Updated by Matthew Roy over 1 year ago

Is the gitbuilder build from that WIP branch fairly safe to test until 0.61 (or will using it horrifically break the upgrade path to 0.61?)

Obviously there are no warranties or guarantees express or implied :)

#31 Updated by Sage Weil over 1 year ago

should be safe!

#32 Updated by Matthew Roy over 1 year ago

After running for a few hours the monitors are not dying! I think you might have got it. I'll update this after it runs for a day or so.

#33 Updated by Joao Luis over 1 year ago

Thanks for the update!

#34 Updated by Matthew Roy over 1 year ago

This has been stable for me for >24-hours. I think you've got it. Thanks for all your help!

#35 Updated by Denis kaganovich over 1 year ago

Fixme if I use wrong branch, but:

0> 2013-04-19 19:06:29.120708 7fb556116700 -1 mon/PaxosService.cc: In function 'void PaxosService::propose_pending()' thread 7fb556116700 time 2013-04-19 19:06:29.119272
mon/PaxosService.cc: 127: FAILED assert(have_pending)
ceph version 0.60-524-g46d8b9f (46d8b9f2c6e36d988e39b116785c7be76af69f10)
1: (PaxosService::propose_pending()+0x35b) [0x513f4b]
2: (MDSMonitor::tick()+0x94c) [0x552e0c]
3: (MDSMonitor::on_active()+0x11) [0x54e9a1]
4: (PaxosService::_active()+0x250) [0x514230]
5: (Context::complete(int)+0xa) [0x4dc63a]
6: (finish_contexts(CephContext*, std::list&lt;Context*, std::allocator&lt;Context*&gt; >&, int)+0xa6) [0x4e3c06]
7: (Paxos::handle_last(MMonPaxos*)+0x93e) [0x50d90e]
8: (Paxos::dispatch(PaxosServiceMessage*)+0x1bb) [0x50ea4b]
9: (Monitor::_ms_dispatch(Message*)+0xde4) [0x4dbf14]
10: (Monitor::ms_dispatch(Message*)+0x32) [0x4ff2a2]
11: (DispatchQueue::entry()+0x33a) [0x70efda]
12: (DispatchQueue::DispatchThread::entry()+0xd) [0x6892ad]
13: (()+0x84bd) [0x7fb55be594bd]
14: (clone()+0x6d) [0x7fb55a0b6f7d]

#36 Updated by Joao Luis over 1 year ago

Denis, you appear to be using master [1]; the fix is only available on wip-3495.

[1]:

joao@tardis:~/inktank/src/ceph/src$ git branch -a --contains 46d8b9f2c6e36d988e39b116785c7be76af69f10
  remotes/gh/HEAD -> gh/master
  remotes/gh/master

#37 Updated by Joao Luis over 1 year ago

  • Status changed from Need Review to Testing

#38 Updated by Sage Weil over 1 year ago

  • Priority changed from High to Urgent

#39 Updated by Greg Farnum over 1 year ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF