Bug #3495: ceph-mon crash - Ceph - Ceph

Actions

Copy link

Bug #3495

closed

ceph-mon crash

Added by Artem Grinblat over 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Have found a core of ceph-mon in /.

Ceph is from "deb http://ceph.com/debian-testing/ wheezy main":

ceph -v
ceph version 0.54 (60b84b095b1009a305d4d6a5b16f88571cbd3150)

Files

Download all files

crash.7z (9.1 MB) crash.7z	core, binary, logs, config	Artem Grinblat, 11/15/2012 07:00 AM
monCrash.tgz (1.29 MB) monCrash.tgz	file created with `ceph-debugpack`	Artem Grinblat, 11/15/2012 07:00 AM
failed_in_logs1.png (78.9 KB) failed_in_logs1.png	Searching for "FAILED assert(have_pending)"	Artem Grinblat, 11/23/2012 11:30 AM
failed_in_logs2.png (43.3 KB) failed_in_logs2.png	Found a lot of occurences in the logs	Artem Grinblat, 11/23/2012 11:30 AM
ceph-mon.3495.7z (25.7 MB) ceph-mon.3495.7z	logs (debug mon = 20; debug ms = 10), core, binary	Artem Grinblat, 12/05/2012 12:56 AM
ceph.3495.rados20.7z (26.1 MB) ceph.3495.rados20.7z	paxos=10 log	Artem Grinblat, 12/05/2012 11:07 AM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Category set to Monitor
Status changed from New to In Progress
Assignee set to Joao Eduardo Luis
Priority changed from Normal to High

Thanks for reporting it. Will look into it.

By the way, this is what the crash looks like:

  -171> 2012-11-15 18:32:33.937581 7f438b594700 -1 mon/PaxosService.cc: In function 'void PaxosService::propose_pending()' thread 7f438b594700 time 2012-11-15 18:32:33.936425
mon/PaxosService.cc: 110: FAILED assert(have_pending)

 ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)
 1: (PaxosService::propose_pending()+0x29b) [0x4bad8b]
 2: (MDSMonitor::tick()+0xda3) [0x4f08e3]
 3: (MDSMonitor::on_active()+0x11) [0x4e8a51]
 4: (PaxosService::_active()+0x22a) [0x4bb02a]
 5: (Context::complete(int)+0xa) [0x49d46a]
 6: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xbc) [0x4a20ac]
 7: (Monitor::recovered_leader(int)+0x36d) [0x49135d]
 8: (Paxos::handle_accept(MMonPaxos*)+0x511) [0x4b5691]
 9: (Paxos::dispatch(PaxosServiceMessage*)+0x20b) [0x4b8d7b]
 10: (Monitor::_ms_dispatch(Message*)+0xb7a) [0x49ca3a]
 11: (Monitor::ms_dispatch(Message*)+0x32) [0x4ae942]
 12: (DispatchQueue::entry()+0x68a) [0x648e1a]
 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x5e505d]
 14: (()+0x6b50) [0x7f4391974b50]
 15: (clone()+0x6d) [0x7f439032570d]

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

It appears that this crash is somewhat related with an election, and then something made the 'have_pending' variable to become false after 'PaxosService::_active()'.

Given the log level on the monitor, there is no way to assess if a new proposal was created, but here's the thing: if 'have_pending' was false from the get go, then a new pending would be created in PaxosService::_active(); if it was true, then a new proposal would have to be proposed mid-'::_active()' and make us incur in some sort of weird race -- but there is no evidence that such thing happened.

Looking at the log we notice that, in the minutes before the failed assertion, there was a couple of successive elections:

2012-11-15 11:42:09.520194 7f76b45db700 -1 mon.a@1(peon) e1 *** Got Signal Terminated ***
2012-11-15 11:42:09.520217 7f76b45db700  1 mon.a@1(peon) e1 shutdown

 [monitor starts]

2012-11-15 11:42:22.200331 7f82b63ee780  1 mon.a@-1(probing) e1 init fsid 24f77b4c-ce12-4071-9746-affad8c0c3ba
2012-11-15 11:42:22.297173 7f82b63ee780  0 mon.a@-1(probing) e1  my rank is now 1 (was -1)
2012-11-15 11:42:22.298234 7f82ae9e2700  1 mon.a@1(probing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum
2012-11-15 11:42:22.298269 7f82ae9e2700  1 mon.a@1(probing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 and sending client elsewhere; we are not in quorum
 [snip duplicate messages]

2012-11-15 11:42:26.299568 7f82ae9e2700  0 log [INF] : mon.a calling new monitor election
 [snip more discarded auth messages]

 [election finished, mon.a is a peon]
2012-11-15 11:42:36.879704 7f82ae9e2700 -1 mon.a@1(peon).paxos(logm updating c 2693275..2693776) lease_expire from mon.0 85.25.26.225:6789/0 is 0.765689 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:38.163912 7f82ae9e2700 -1 mon.a@1(peon).paxos(auth updating c 5710..5731) lease_expire from mon.0 85.25.26.225:6789/0 is 0.607050 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:38.215428 7f82ae9e2700 -1 mon.a@1(peon).paxos(pgmap active c 2639017..2639518) lease_expire from mon.0 85.25.26.225:6789/0 is 0.486926 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:38.215598 7f82ae9e2700 -1 mon.a@1(peon).paxos(mdsmap active c 1..5608) lease_expire from mon.0 85.25.26.225:6789/0 is 0.487026 seconds in the past; mons are laggy or clocks are too skewed
2012-11-15 11:42:41.804620 7f82ae9e2700  1 mon.a@1(peon).osd e11052 e11052: 4 osds: 3 up, 4 in
2012-11-15 11:42:45.802772 7f82ae9e2700  1 mon.a@1(peon).osd e11053 e11053: 4 osds: 2 up, 4 in
2012-11-15 11:42:48.219469 7f82ae9e2700  1 mon.a@1(peon).osd e11054 e11054: 4 osds: 4 up, 4 in
2012-11-15 11:42:50.223239 7f82ae9e2700  1 mon.a@1(peon).osd e11055 e11055: 4 osds: 3 up, 4 in
2012-11-15 11:43:10.140877 7f82ae9e2700  0 log [INF] : mon.a calling new monitor election
2012-11-15 11:43:16.255093 7f82afcf5700  0 -- 188.138.84.236:6789/0 >> 85.25.26.225:6789/0 pipe(0x1d3d8c0 sd=24 :6789 pgs=12 cs=1 l=0).fault with nothing to send, going to standby
2012-11-15 11:43:18.312369 7f82ae9e2700  1 mon.a@1(peon).osd e11056 e11056: 4 osds: 4 up, 4 in
2012-11-15 11:43:19.007349 7f82ae9e2700  1 mon.a@1(peon).osd e11057 e11057: 4 osds: 4 up, 4 in
2012-11-15 11:43:26.848048 7f82ae9e2700  0 log [INF] : mon.a calling new monitor election

 [new election underway]

2012-11-15 11:43:42.271375 7f82ae9e2700  1 mon.a@1(electing) e1 discarding message mon_subscribe({monmap=2+,osdmap=11058}) v2 and sending client elsewhere; we are not in quorum
2012-11-15 11:43:47.373292 7f82ae1e1700  0 log [INF] : mon.a@1 won leader election with quorum 1,2

Soon after, the assertion failed.

Given the code path, there's a chance that a Paxos proposal was underway too. The only potential reference to this comes in the form of a 'OSDMonitor::prepare_failure()' moments before the crash:

2012-11-15 11:43:56.324468 7f82ae9e2700  1 mon.a@1(leader).osd e11058 prepare_failure osd.0 188.138.84.236:6801/12408 from osd.3 85.25.99.55:6800/29613 is reporting failure:1
2012-11-15 11:43:56.324486 7f82ae9e2700  0 log [DBG] : osd.0 188.138.84.236:6801/12408 reported failed by osd.3 85.25.99.55:6800/29613

But given that we have no more informations than this, the only thing that makes sense is that there was an on-going MDSMonitor proposal when we woke everyone up in 'Monitor::recovered_leader()'. I'm working on a patch that will make sure that we only wake up a paxos machine's callbacks when the paxos machine is active, and this should not happen again, but will be tough to confirm this as the root cause.

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Status changed from In Progress to Fix Under Review

wip-mon-misc-fixes contains commit 96b82ebf870bda5af33624f698785bac6ae78234 that should fix this bug.

The approach is basically to close any window that may lead to awakening contexts waiting on any one of the queues (such as waiting_for_active) while a paxos machine is not in a state that could cope with the contexts being woken up.

This appears to be the reason for this bug. For instance, say that we have a 'C_Active' context waiting on the MDSMonitor::waiting_for_active. Now say that the MDSMonitor is about to start a proposal. Now imagine that another paxos machine (say, the OSDMonitor) finishes its proposal, ends up in Monitor::recovered_leader() and starts waking up all the contexts in all the Paxos machines. If it awakes that 'C_Active' context on the MDSMonitor at the wrong time, nasty things may happen.

This patch may or may not fix this issue (as we have no way to confirm using the logs), but it certainly fixes some issue that may or may not have happened yet.

Actions

Copy link Download all files

Updated by Artem Grinblat over 11 years ago

File failed_in_logs1.png failed_in_logs1.png added
File failed_in_logs2.png failed_in_logs2.png added

Joao, what kind of ceph configuration would make the logs more useful in the future?
Ceph crashes fairly often here, maybe I would be able to reproduce the issue with a more detailed logs. See screenshots.

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Sorry, marked this as resolved and only noticed the last comment afterwards (silly me).

For the monitor, a good log level of 20 for the monitor and a 10 for ms are probably the most useful in terms of the information gathered, but they will also grow your logs considerably. If you are not able to cope with that amount of logging, then we can obtain some fair amount of infos from a debug level of 10 on the monitor and ms 1.

In terms of configuration, that would be something like

[mon]
  debug mon = 20 (or 10)
  debug ms = 10 (or 1)

I'm pretty confident that the patch will fix this issue, but if you happen to stumble across this bug again, please feel free to poke me ;-)

Actions

Copy link

Updated by Artem Grinblat over 11 years ago

File ceph-mon.3495.7z ceph-mon.3495.7z added

"FAILED assert(have_pending)" still happens with Ceph 0.55.
I have attached the logs.

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Status changed from Resolved to In Progress

Thanks for following up on this. And thanks for cranking up the debug levels! :-)

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Artem,

The good news is that, given the bug being triggered yet again on the same spot, it would appear to be localized to the MDSMonitor; the not-so-good news is that I'm still looking for the root cause.

Are you doing anything in particular to trigger this bug? I would love to be able to reproduce it. Also, if you could set 'debug paxos = 10' on the monitor it would be great, as it would allow me to pinpoint what is causing the 'have_pending' to turn false -- debugging paxos usually isn't relevant enough to be worth to spam your logs, thus why I didn't suggested it earlier, but in this case it would be nice to have it.

Actions

Copy link

#10

Updated by Artem Grinblat over 11 years ago

I'm not doing anything special, the usual librados uploads and removes. There is also a recovery in background: "recovery 178902/6622808 degraded (2.701%)".
I'll upload the 'debug paxos = 10' logs if (when) the mon crashes again.

Actions

Copy link

#11

Updated by Artem Grinblat over 11 years ago

Are you doing anything in particular to trigger this bug?

With Ceph 0.54 I've noticed that

      shared_ptr<std::forward_list<gstring>> ids (std::make_shared<std::forward_list<gstring>>());
      for (auto it = rados.second->objects_begin(), end = rados.second->objects_end(); it != end; ++it) {
        ids->emplace_front (it->first);
        if (++*cephTotal % 100 == 0) std::this_thread::sleep_for (std::chrono::milliseconds (valgrind ? 200 : 20)); // Be nice to Ceph.
      }

always fails sooner or later, even with the loop running against a thread-local Ceph connections, whereas "rados -p {pool} ls" (which does exactly the same loop) - works. So I have a suspicion that Ceph isn't tested that well when the same process uses librados from multiple threads. Don't know if it is related to this bug or not.

Actions

Copy link

#12

Updated by Artem Grinblat over 11 years ago

(As a workaround I'm now doing "nice rados -p {pool} ls > {pool}.ids" and reading ids from the file).

Actions

Copy link

#13

Updated by Josh Durgin over 11 years ago

Could you open a separate bug with the crash you see with the multithreaded object listing? There is some shared state for IoCtxs that shouldn't be relied on from multiple threads, but listing objects shouldn't hit that.

Actions

Copy link

#14

Updated by Artem Grinblat over 11 years ago

Could you open a separate bug with the crash you see with the multithreaded object listing?

I might be able later.
What would be the best logging settings for that report? Thanks.

Actions

Copy link

#15

Updated by Josh Durgin over 11 years ago

debug ms = 1
debug objecter = 20
debug rados = 20

Actions

Copy link

#16

Updated by Artem Grinblat over 11 years ago

File ceph.3495.rados20.7z ceph.3495.rados20.7z added

Attached is the file with a paxos=10 log.

Actions

Copy link

#17

Updated by Joao Eduardo Luis over 11 years ago

Thanks Artem. Your log confirmed my suspicions regarding to what the bug was.

If you have the chance, could you try commit 46c55a19e72ae1f16037807fda8abe59b671f7d3 on wip-3495? It should fix the issue, but I haven't been able to reproduce the bug in order to be 100% sure.

Actions

Copy link

#18

Updated by Artem Grinblat over 11 years ago

If you have the chance, could you try commit 46c55a19e72ae1f16037807fda8abe59b671f7d3 on wip-3495? It should fix the issue, but I haven't been able to reproduce the bug in order to be 100% sure.

Joao, thanks for the fix!
I'd prefer to wait for 0.56.

Actions

Copy link

#19

Updated by Joao Eduardo Luis over 11 years ago

Sure. I'll try to reproduce this bug again before bobtail is released. From the looks of it, messing with the MDS enough while bringing the cluster up ought to trigger it, so I might be able to do it with.

Thanks for reporting it and all your help tracking it!

Actions

Copy link

#20

Updated by Joao Eduardo Luis over 11 years ago

Managed to trigger this bug using

$ while [ 1 ]; do ./init-ceph restart mon.a ; sleep 30 ; done
$ while [ 1 ]; do ./init-ceph restart mds.a ; sleep 2 ; done
$ while [ 1 ]; do ./init-ceph restart osd.1 ; sleep 2 ; done

The bug was triggered on mon.b, everything running v0.55.

Actions

Copy link

#21

Updated by Matthew Via over 11 years ago

I was encountering this particular bug, causing all my monitors to crash shortly after start. I patched my 0.55 packages with the patch from wip-3495, and have not encountered a single crash under heavy load for about 15 hours.

Actions

Copy link

#22

Updated by Joao Eduardo Luis over 11 years ago

Good to know Matthew. I haven't been able to reproduce it either since using a patched version.

And the patch has been merged onto next, so I'm marking the bug as resolved.

Actions

Copy link

#23

Updated by Joao Eduardo Luis over 11 years ago

Status changed from In Progress to Resolved

Actions

Copy link

#24

Updated by Joao Eduardo Luis over 11 years ago

Status changed from Resolved to In Progress

This bug occurred again on 0.55.1, as described on bug #3618. Re-opening as it appears that there's still something else going on.

Actions

Copy link

#25

Updated by Joao Eduardo Luis over 11 years ago

Status changed from In Progress to Resolved

Sorry, jumped the gun here. Thought the fix was on 0.55.1 but it is not. It's on next though.

Actions

Copy link

#26

Updated by Joao Eduardo Luis about 11 years ago

Status changed from Resolved to In Progress

Matthew Roy, imjustmattew on IRC, stumbled upon this bug yet again, this time on 0.60.

This was reintroduced with the monitor rework which basically undid the patch introduced with e1c27fe178bb76b2f3127876b868968730bdb074 and I suspect the solution is to redo the patch.

Actions

Copy link

#27

Updated by Joao Eduardo Luis about 11 years ago

Well, my above comment is not entirely true. The patch itself was not undone, but rather the patch's focus was on the multiple Paxos instances that were used back then. Currently, we only have one Paxos instance, so we should focus on the services.

While the obvious workaround is to force a given order on how we poke the services, in order to avoid the MDSMonitor invoking the OSDMonitor's propose_pending() before the OSDMonitor has been poked. This however seems wrong in the sense that we should instead force the MDSMonitor to wait for the OSDMonitor to become active prior to performing write operations.

Actions

Copy link

#28

Updated by Joao Eduardo Luis about 11 years ago

Status changed from In Progress to Fix Under Review

Pushed wip-3495 to github. So far, the fix appears to be holding up without breaking anything else.

I'm leaving the test running for the night and should be more certain by the morning.

In the meantime, reviews are welcome.

Actions

Copy link

#29

Updated by Joao Eduardo Luis about 11 years ago

Fix has been holding for the last 12 hours or so.

Actions

Copy link

#30

Updated by Matthew Roy about 11 years ago

Is the gitbuilder build from that WIP branch fairly safe to test until 0.61 (or will using it horrifically break the upgrade path to 0.61?)

Obviously there are no warranties or guarantees express or implied :)

Actions

Copy link

#31

Updated by Sage Weil about 11 years ago

should be safe!

Actions

Copy link

#32

Updated by Matthew Roy about 11 years ago

After running for a few hours the monitors are not dying! I think you might have got it. I'll update this after it runs for a day or so.

Actions

Copy link

#33

Updated by Joao Eduardo Luis about 11 years ago

Thanks for the update!

Actions

Copy link

#34

Updated by Matthew Roy about 11 years ago

This has been stable for me for >24-hours. I think you've got it. Thanks for all your help!

Actions

Copy link

#35

Updated by Denis kaganovich about 11 years ago

Fixme if I use wrong branch, but:

0> 2013-04-19 19:06:29.120708 7fb556116700 -1 mon/PaxosService.cc: In function 'void PaxosService::propose_pending()' thread 7fb556116700 time 2013-04-19 19:06:29.119272
mon/PaxosService.cc: 127: FAILED assert(have_pending)

ceph version 0.60-524-g46d8b9f (46d8b9f2c6e36d988e39b116785c7be76af69f10)
 1: (PaxosService::propose_pending()+0x35b) [0x513f4b]
 2: (MDSMonitor::tick()+0x94c) [0x552e0c]
 3: (MDSMonitor::on_active()+0x11) [0x54e9a1]
 4: (PaxosService::_active()+0x250) [0x514230]
 5: (Context::complete(int)+0xa) [0x4dc63a]
 6: (finish_contexts(CephContext*, std::list&lt;Context*, std::allocator&lt;Context*&gt; >&, int)+0xa6) [0x4e3c06]
 7: (Paxos::handle_last(MMonPaxos*)+0x93e) [0x50d90e]
 8: (Paxos::dispatch(PaxosServiceMessage*)+0x1bb) [0x50ea4b]
 9: (Monitor::_ms_dispatch(Message*)+0xde4) [0x4dbf14]
 10: (Monitor::ms_dispatch(Message*)+0x32) [0x4ff2a2]
 11: (DispatchQueue::entry()+0x33a) [0x70efda]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x6892ad]
 13: (()+0x84bd) [0x7fb55be594bd]
 14: (clone()+0x6d) [0x7fb55a0b6f7d]

Actions

Copy link

#36

Updated by Joao Eduardo Luis about 11 years ago

Denis, you appear to be using master [1]; the fix is only available on wip-3495.

[1]:

joao@tardis:~/inktank/src/ceph/src$ git branch -a --contains 46d8b9f2c6e36d988e39b116785c7be76af69f10
  remotes/gh/HEAD -> gh/master
  remotes/gh/master

Actions

Copy link

#37

Updated by Joao Eduardo Luis about 11 years ago

Status changed from Fix Under Review to 7

Actions

Copy link

#38

Updated by Sage Weil almost 11 years ago

Priority changed from High to Urgent

Actions

Copy link

#39

Updated by Greg Farnum almost 11 years ago

Status changed from 7 to Resolved

Merged into next with 426e3be64e851947b288e43bc0ee932ae7f214bb

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3495

ceph-mon crash

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Josh Durgin over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Artem Grinblat over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Matthew Via over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis about 11 years ago

Updated by Joao Eduardo Luis about 11 years ago

Updated by Joao Eduardo Luis about 11 years ago

Updated by Joao Eduardo Luis about 11 years ago

Updated by Matthew Roy about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Matthew Roy about 11 years ago

Updated by Joao Eduardo Luis about 11 years ago

Updated by Matthew Roy about 11 years ago

Updated by Denis kaganovich about 11 years ago

Updated by Joao Eduardo Luis about 11 years ago

Updated by Joao Eduardo Luis about 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Greg Farnum almost 11 years ago