Project

General

Profile

Bug #1130

monitor crash in PaxosService:propose_pending()

Added by Sam Lang over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

While doing some failure testing, one of the ceph monitors crashed. I have 6 osds, 3 monitors, and 3 mds servers running on my setup. I've been periodically killing osd processes on various nodes and restarting them to see how ceph responds. I'm using the fuse module and the stable branch of git (commit 7330c3c473aa128b1e3ecb8752278f655bc79620). This part of the log includes the stacktrace and what the monitor was doing before the crash:

2011-06-01 15:19:42.242826 7f651498e700 -- 192.168.60.134:6789/0 >> 192.168.60.134:0/1677834944 pipe(0x7f65046ab7b0 sd=7 pgs=0 cs=0 l=0).accept peer addr is really 192.168.60.134:0/1677834944 (socket is 192.168.60.134:59836/0)
2011-06-01 16:01:21.265302 7f651689f700 mon.alpha@0(leader).osd e496 prepare_failure osd0 192.168.60.134:6801/7429 from osd4 192.168.60.132:6800/32447 is reporting failure:1
2011-06-01 16:01:21.274061 7f651689f700 log [INF] : osd0 192.168.60.134:6801/7429 failed (by osd4 192.168.60.132:6800/32447)
2011-06-01 16:01:22.505577 7f651689f700 mon.alpha@0(leader).osd e496 prepare_failure osd0 192.168.60.134:6801/7429 from osd3 192.168.60.129:6801/909 is reporting failure:1
2011-06-01 16:01:22.505672 7f651689f700 log [INF] : osd0 192.168.60.134:6801/7429 failed (by osd3 192.168.60.129:6801/909)
2011-06-01 16:01:23.398247 7f651689f700 mon.alpha@0(leader).osd e496 prepare_failure osd0 192.168.60.134:6801/7429 from osd2 192.168.60.104:6801/2177 is reporting failure:1
2011-06-01 16:01:23.398307 7f651689f700 log [INF] : osd0 192.168.60.134:6801/7429 failed (by osd2 192.168.60.104:6801/2177)
2011-06-01 16:01:23.398356 7f651689f700 mon.alpha@0(leader).osd e496 have enough reports/reporters to mark osd0 as down
2011-06-01 16:01:23.399133 7f651689f700 mon.alpha@0(leader).osd e496 prepare_failure osd0 192.168.60.134:6801/7429 from osd1 192.168.60.109:6801/2487 is reporting failure:1
2011-06-01 16:01:23.399183 7f651689f700 log [INF] : osd0 192.168.60.134:6801/7429 failed (by osd1 192.168.60.109:6801/2487)
2011-06-01 16:01:23.640799 7f651689f700 mon.alpha@0(leader).osd e497 e497: 6 osds: 5 up, 6 in
2011-06-01 16:01:25.673808 7f651689f700 mon.alpha@0(leader).osd e498 e498: 6 osds: 5 up, 6 in
2011-06-01 16:01:27.546309 7f651689f700 mon.alpha@0(leader).osd e499 e499: 6 osds: 5 up, 6 in
2011-06-01 16:06:25.843100 7f651609e700 log [INF] : osd0 out (down for 302.147264)
2011-06-01 16:06:26.046738 7f651689f700 mon.alpha@0(leader).osd e500 e500: 6 osds: 5 up, 5 in
2011-06-01 16:06:27.835389 7f651689f700 mon.alpha@0(leader).osd e501 e501: 6 osds: 5 up, 5 in
2011-06-01 16:06:28.809796 7f651689f700 mon.alpha@0(leader).osd e502 e502: 6 osds: 5 up, 5 in
2011-06-01 16:06:30.016019 7f651689f700 mon.alpha@0(leader).osd e503 e503: 6 osds: 5 up, 5 in
2011-06-01 16:06:31.223714 7f651689f700 mon.alpha@0(leader).osd e504 e504: 6 osds: 5 up, 5 in
2011-06-01 16:06:33.107593 7f651689f700 mon.alpha@0(leader).osd e505 e505: 6 osds: 5 up, 5 in
2011-06-01 16:06:35.815903 7f651689f700 mon.alpha@0(leader).osd e506 e506: 6 osds: 5 up, 5 in
2011-06-01 16:06:37.945246 7f651689f700 mon.alpha@0(leader).osd e507 e507: 6 osds: 5 up, 5 in
2011-06-01 16:06:40.229770 7f651689f700 mon.alpha@0(leader).osd e508 e508: 6 osds: 5 up, 5 in
2011-06-01 16:06:42.648366 7f651689f700 mon.alpha@0(leader).osd e509 e509: 6 osds: 5 up, 5 in
2011-06-01 16:06:44.909051 7f651689f700 mon.alpha@0(leader).osd e510 e510: 6 osds: 5 up, 5 in
2011-06-01 16:06:46.472116 7f651689f700 mon.alpha@0(leader).osd e511 e511: 6 osds: 5 up, 5 in
2011-06-01 16:06:48.333818 7f651689f700 mon.alpha@0(leader).osd e512 e512: 6 osds: 5 up, 5 in
2011-06-01 16:06:50.564448 7f651689f700 mon.alpha@0(leader).osd e513 e513: 6 osds: 5 up, 5 in
2011-06-01 16:06:53.374415 7f651689f700 mon.alpha@0(leader).osd e514 e514: 6 osds: 5 up, 5 in
2011-06-01 16:38:38.473470 7f651689f700 mon.alpha@0(leader).osd e514 prepare_failure osd2 192.168.60.104:6801/2177 from osd1 192.168.60.109:6801/2487 is reporting failure:1
2011-06-01 16:38:38.473565 7f651689f700 log [INF] : osd2 192.168.60.104:6801/2177 failed (by osd1 192.168.60.109:6801/2487)
2011-06-01 16:38:39.782166 7f651689f700 mon.alpha@0(leader).osd e514 prepare_failure osd2 192.168.60.104:6801/2177 from osd3 192.168.60.129:6801/909 is reporting failure:1
2011-06-01 16:38:39.782230 7f651689f700 log [INF] : osd2 192.168.60.104:6801/2177 failed (by osd3 192.168.60.129:6801/909)
2011-06-01 16:38:41.698370 7f651689f700 mon.alpha@0(leader).osd e514 prepare_failure osd2 192.168.60.104:6801/2177 from osd4 192.168.60.132:6800/32447 is reporting failure:1
2011-06-01 16:38:41.698442 7f651689f700 log [INF] : osd2 192.168.60.104:6801/2177 failed (by osd4 192.168.60.132:6800/32447)
2011-06-01 16:38:41.698516 7f651689f700 mon.alpha@0(leader).osd e514 have enough reports/reporters to mark osd2 as down
2011-06-01 16:38:41.947293 7f651689f700 mon.alpha@0(leader).osd e515 e515: 6 osds: 4 up, 5 in
2011-06-01 16:38:43.335429 7f651689f700 mon.alpha@0(leader).osd e516 e516: 6 osds: 4 up, 5 in
2011-06-01 16:38:55.539341 7f651689f700 mon.alpha@0(leader).osd e517 e517: 6 osds: 5 up, 5 in
2011-06-01 16:38:55.618491 7f651689f700 log [INF] : osd2 192.168.60.104:6800/14746 boot
2011-06-01 16:38:57.061877 7f651689f700 mon.alpha@0(leader).osd e518 e518: 6 osds: 5 up, 5 in
2011-06-01 16:38:58.936829 7f651689f700 mon.alpha@0(leader).osd e519 e519: 6 osds: 5 up, 5 in
2011-06-01 16:57:05.928007 7f651689f700 mon.alpha@0(leader).osd e519 prepare_failure osd2 192.168.60.104:6800/14746 from osd4 192.168.60.132:6800/32447 is reporting failure:1
2011-06-01 16:57:05.928102 7f651689f700 log [INF] : osd2 192.168.60.104:6800/14746 failed (by osd4 192.168.60.132:6800/32447)
2011-06-01 16:57:06.297512 7f651689f700 mon.alpha@0(leader).osd e519 prepare_failure osd2 192.168.60.104:6800/14746 from osd5 192.168.60.135:6800/19249 is reporting failure:1
2011-06-01 16:57:06.297541 7f651689f700 log [INF] : osd2 192.168.60.104:6800/14746 failed (by osd5 192.168.60.135:6800/19249)
2011-06-01 16:57:06.955549 7f651689f700 mon.alpha@0(leader).osd e519 prepare_failure osd2 192.168.60.104:6800/14746 from osd3 192.168.60.129:6801/909 is reporting failure:1
2011-06-01 16:57:06.955603 7f651689f700 log [INF] : osd2 192.168.60.104:6800/14746 failed (by osd3 192.168.60.129:6801/909)
2011-06-01 16:57:06.955652 7f651689f700 mon.alpha@0(leader).osd e519 have enough reports/reporters to mark osd2 as down
2011-06-01 16:57:07.222548 7f651689f700 mon.alpha@0(leader).osd e520 e520: 6 osds: 4 up, 5 in
2011-06-01 16:57:08.555269 7f651689f700 mon.alpha@0(leader).osd e521 e521: 6 osds: 4 up, 5 in
2011-06-01 16:58:07.757024 7f651689f700 mon.alpha@0(leader).osd e521 prepare_failure osd3 192.168.60.129:6801/909 from osd1 192.168.60.109:6801/2487 is reporting failure:1
2011-06-01 16:58:07.757134 7f651689f700 log [INF] : osd3 192.168.60.129:6801/909 failed (by osd1 192.168.60.109:6801/2487)
2011-06-01 16:58:07.951974 7f651689f700 mon.alpha@0(leader).osd e521 prepare_failure osd3 192.168.60.129:6801/909 from osd4 192.168.60.132:6800/32447 is reporting failure:1
2011-06-01 16:58:07.952058 7f651689f700 log [INF] : osd3 192.168.60.129:6801/909 failed (by osd4 192.168.60.132:6800/32447)
2011-06-01 16:58:08.176448 7f651689f700 mon.alpha@0(leader).osd e521 prepare_failure osd3 192.168.60.129:6801/909 from osd5 192.168.60.135:6800/19249 is reporting failure:1
2011-06-01 16:58:08.176494 7f651689f700 log [INF] : osd3 192.168.60.129:6801/909 failed (by osd5 192.168.60.135:6800/19249)
2011-06-01 16:58:08.176518 7f651689f700 mon.alpha@0(leader).osd e521 have enough reports/reporters to mark osd3 as down
2011-06-01 16:58:08.400491 7f651689f700 mon.alpha@0(leader).osd e522 e522: 6 osds: 3 up, 5 in
2011-06-01 16:58:09.545010 7f651689f700 mon.alpha@0(leader).osd e523 e523: 6 osds: 3 up, 5 in
2011-06-01 16:58:39.112888 7f651689f700 mon.alpha@0(leader).osd e523 prepare_failure osd4 192.168.60.132:6800/32447 from osd5 192.168.60.135:6800/19249 is reporting failure:1
2011-06-01 16:58:39.112984 7f651689f700 log [INF] : osd4 192.168.60.132:6800/32447 failed (by osd5 192.168.60.135:6800/19249)
2011-06-01 16:58:39.546901 7f651689f700 mon.alpha@0(leader).osd e523 prepare_failure osd4 192.168.60.132:6800/32447 from osd1 192.168.60.109:6801/2487 is reporting failure:1
2011-06-01 16:58:39.546943 7f651689f700 log [INF] : osd4 192.168.60.132:6800/32447 failed (by osd1 192.168.60.109:6801/2487)
2011-06-01 16:58:44.113835 7f651689f700 mon.alpha@0(leader).osd e523 prepare_failure osd4 192.168.60.132:6800/32447 from osd5 192.168.60.135:6800/19249 is reporting failure:1
2011-06-01 16:58:44.113923 7f651689f700 log [INF] : osd4 192.168.60.132:6800/32447 failed (by osd5 192.168.60.135:6800/19249)
2011-06-01 16:58:44.113964 7f651689f700 mon.alpha@0(leader).osd e523 have enough reports/reporters to mark osd4 as down
2011-06-01 16:58:44.371904 7f651689f700 mon.alpha@0(leader).osd e524 e524: 6 osds: 2 up, 5 in
2011-06-01 16:58:45.682595 7f651689f700 mon.alpha@0(leader).osd e525 e525: 6 osds: 2 up, 5 in
2011-06-01 17:00:19.863262 7f651689f700 mon.alpha@0(leader).osd e526 e526: 6 osds: 3 up, 5 in
2011-06-01 17:00:19.953792 7f651689f700 log [INF] : osd4 192.168.60.132:6800/22996 boot
2011-06-01 17:00:21.241563 7f651689f700 mon.alpha@0(leader).osd e527 e527: 6 osds: 3 up, 5 in
2011-06-01 17:00:47.710336 7f651689f700 mon.alpha@0(leader).osd e528 e528: 6 osds: 4 up, 5 in
2011-06-01 17:00:47.800673 7f651689f700 log [INF] : osd3 192.168.60.129:6800/8519 boot
2011-06-01 17:00:49.477564 7f651689f700 mon.alpha@0(leader).osd e529 e529: 6 osds: 4 up, 5 in
2011-06-01 17:00:51.962324 7f651689f700 mon.alpha@0(leader).osd e530 e530: 6 osds: 4 up, 5 in
2011-06-01 17:01:11.383223 7f651689f700 mon.alpha@0(leader).osd e531 e531: 6 osds: 5 up, 5 in
2011-06-01 17:01:11.473654 7f651689f700 log [INF] : osd2 192.168.60.104:6800/15339 boot
2011-06-01 17:01:13.084505 7f651689f700 mon.alpha@0(leader).osd e532 e532: 6 osds: 5 up, 5 in
2011-06-01 17:01:14.359914 7f651689f700 mon.alpha@0(leader).osd e533 e533: 6 osds: 5 up, 5 in
2011-06-01 17:02:28.750063 7f651689f700 mon.alpha@0(leader).osd e534 e534: 6 osds: 4 up, 5 in
2011-06-01 17:02:30.001979 7f651689f700 mon.alpha@0(leader).osd e535 e535: 6 osds: 5 up, 5 in
2011-06-01 17:02:30.081251 7f651689f700 log [INF] : osd5 192.168.60.135:6800/32009 boot
2011-06-01 17:02:31.502846 7f651689f700 mon.alpha@0(leader).osd e536 e536: 6 osds: 5 up, 5 in
2011-06-01 17:02:33.908873 7f651689f700 mon.alpha@0(leader).osd e537 e537: 6 osds: 5 up, 5 in
2011-06-01 17:02:37.108742 7f651689f700 mon.alpha@0(leader).osd e538 e538: 6 osds: 5 up, 5 in
2011-06-01 18:09:19.781399 7f651689f700 mon.alpha@0(leader).osd e539 e539: 6 osds: 6 up, 6 in
2011-06-01 18:09:20.038401 7f651689f700 log [INF] : osd0 192.168.60.134:6800/14892 boot
2011-06-01 18:09:21.929423 7f651609e700 mon.alpha@0(leader).osd e539 OSDMonitor::handle_osd_timeouts: last got MOSDPGStat info from osd 0 at 2011-06-01 15:51:07.773635. It has been 8294.155758, so we're marking it down!
../../src/mon/PaxosService.cc: In function 'void PaxosService::propose_pending()', in thread '0x7f651609e700'
../../src/mon/PaxosService.cc: 126: FAILED assert(have_pending)
ceph version (commit:)
1: (PaxosService::propose_pending()+0x265) [0x498f55]
2: (OSDMonitor::handle_osd_timeouts(utime_t const&, std::map<int, utime_t, std::less<int>, std::allocator<std::pair<int const, utime_t> > > const&)+0x3be) [0x49c9ee]
3: (PGMonitor::handle_osd_timeouts()+0xef) [0x4d54df]
4: (PGMonitor::tick()+0x3e) [0x4e088e]
5: (Monitor::tick()+0x45) [0x478cc5]
6: (SafeTimer::timer_thread()+0x29b) [0x57b6eb]
7: (SafeTimerThread::entry()+0xd) [0x57cc0d]
8: (()+0x6d8c) [0x7f651934fd8c]
9: (clone()+0x6d) [0x7f65188fa04d]
ceph version (commit:)
1: (PaxosService::propose_pending()+0x265) [0x498f55]
2: (OSDMonitor::handle_osd_timeouts(utime_t const&, std::map<int, utime_t, std::less<int>, std::allocator<std::pair<int const, utime_t> > > const&)+0x3be) [0x49c9ee]
3: (PGMonitor::handle_osd_timeouts()+0xef) [0x4d54df]
4: (PGMonitor::tick()+0x3e) [0x4e088e]
5: (Monitor::tick()+0x45) [0x478cc5]
6: (SafeTimer::timer_thread()+0x29b) [0x57b6eb]
7: (SafeTimerThread::entry()+0xd) [0x57cc0d]
8: (()+0x6d8c) [0x7f651934fd8c]
9: (clone()+0x6d) [0x7f65188fa04d]
  • Caught signal (Aborted) *
    in thread 0x7f651609e700
    ceph version (commit:)
    1: /usr/ceph/bin/cmon() [0x5a1d39]
    2: (()+0xfc60) [0x7f6519358c60]
    3: (gsignal()+0x35) [0x7f6518847d05]
    4: (abort()+0x186) [0x7f651884bab6]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f65190fe6dd]
    6: (()+0xb9926) [0x7f65190fc926]
    7: (()+0xb9953) [0x7f65190fc953]
    8: (()+0xb9a5e) [0x7f65190fca5e]
    9: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x362) [0x57d222]
    10: (PaxosService::propose_pending()+0x265) [0x498f55]
    11: (OSDMonitor::handle_osd_timeouts(utime_t const&, std::map<int, utime_t, std::less<int>, std::allocator<std::pair<int const, utime_t> > > const&)+0x3be) [0x49c9ee]
    12: (PGMonitor::handle_osd_timeouts()+0xef) [0x4d54df]
    13: (PGMonitor::tick()+0x3e) [0x4e088e]
    14: (Monitor::tick()+0x45) [0x478cc5]
    15: (SafeTimer::timer_thread()+0x29b) [0x57b6eb]
    16: (SafeTimerThread::entry()+0xd) [0x57cc0d]
    17: (()+0x6d8c) [0x7f651934fd8c]
    18: (clone()+0x6d) [0x7f65188fa04d]

Associated revisions

Revision 7e2e4779 (diff)
Added by Sage Weil over 8 years ago

mon: make sure osd paxos is writeable before doing timeouts

The osd paxos machine has to be writeable before we can update it.

Fixes: #1130
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved
  • Assignee set to Sage Weil
  • Target version set to v0.30

This should be fixed by 7e2e4779e4323429167af36e9a5fb9741c075e96. Thanks for the report!

Also available in: Atom PDF