Project

General

Profile

Bug #1909

Two mons crash after starting the third one

Added by Maciej Galkiewicz about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had three mons. One of them was reinstalled without removing it from the cluster. Now after starting reinstalled mon, the rest crash with error:

2012-01-09 16:52:32.251857 7f28a7231700 -- 1.1.1.1:6789/0 >> 2.2.2.2:6800/0 pipe(0x2709780 sd=41 pgs=1 cs=1 l=0).fault with nothing to send, going to standby
2012-01-09 16:52:37.276071 7f28a8839700 log [INF] : mon.n4c1 calling new monitor election
mon/MonMap.h: In function 'entity_inst_t MonMap::get_inst(unsigned int)', in thread '7f28a8839700'
mon/MonMap.h: 162: FAILED assert(m < rank_addr.size())
 ceph version 0.39-195-ge18b1c9 (commit:e18b1c9734e88e3b779ba2d70cdd54f8fb94743d)
 1: (Elector::defer(int)+0x29a) [0x5050aa]
 2: (Elector::handle_propose(MMonElection*)+0x30b) [0x5053eb]
 3: (Elector::dispatch(Message*)+0x7cb) [0x506d8b]
 4: (Monitor::_ms_dispatch(Message*)+0xcf4) [0x47e7f4]
 5: (Monitor::ms_dispatch(Message*)+0x90) [0x48c720]
 6: (SimpleMessenger::dispatch_entry()+0x869) [0x582ef9]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4664bc]
 8: (()+0x68ba) [0x7f28ac2c98ba]
 9: (clone()+0x6d) [0x7f28aab2502d]
 ceph version 0.39-195-ge18b1c9 (commit:e18b1c9734e88e3b779ba2d70cdd54f8fb94743d)
 1: (Elector::defer(int)+0x29a) [0x5050aa]
 2: (Elector::handle_propose(MMonElection*)+0x30b) [0x5053eb]
 3: (Elector::dispatch(Message*)+0x7cb) [0x506d8b]
 4: (Monitor::_ms_dispatch(Message*)+0xcf4) [0x47e7f4]
 5: (Monitor::ms_dispatch(Message*)+0x90) [0x48c720]
 6: (SimpleMessenger::dispatch_entry()+0x869) [0x582ef9]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4664bc]
 8: (()+0x68ba) [0x7f28ac2c98ba]
 9: (clone()+0x6d) [0x7f28aab2502d]
*** Caught signal (Aborted) **
 in thread 7f28a8839700
 ceph version 0.39-195-ge18b1c9 (commit:e18b1c9734e88e3b779ba2d70cdd54f8fb94743d)
 1: /usr/bin/ceph-mon() [0x5cfc89]
 2: (()+0xef60) [0x7f28ac2d1f60]
 3: (gsignal()+0x35) [0x7f28aaa88165]
 4: (abort()+0x180) [0x7f28aaa8af70]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f28ab309c2d]
 6: (()+0xb8dd6) [0x7f28ab307dd6]
 7: (()+0xb8e03) [0x7f28ab307e03]
 8: (()+0xb8efe) [0x7f28ab307efe]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x3a7) [0x5a1e17]
 10: (Elector::defer(int)+0x29a) [0x5050aa]
 11: (Elector::handle_propose(MMonElection*)+0x30b) [0x5053eb]
 12: (Elector::dispatch(Message*)+0x7cb) [0x506d8b]
 13: (Monitor::_ms_dispatch(Message*)+0xcf4) [0x47e7f4]
 14: (Monitor::ms_dispatch(Message*)+0x90) [0x48c720]
 15: (SimpleMessenger::dispatch_entry()+0x869) [0x582ef9]
 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4664bc]
 17: (()+0x68ba) [0x7f28ac2c98ba]
 18: (clone()+0x6d) [0x7f28aab2502d]

Is it necessary to remove and add it once again?

Associated revisions

Revision 675e4c41 (diff)
Added by Sage Weil about 12 years ago

mon: drop election messages with bad rank

The bad message came from old code pre-bfbeae68c045de76ede86ca4f72d2a760a19c84b.

Fixes: #1909
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil about 12 years ago

  • Status changed from New to Need More Info

can you generate a log with 'debug mon = 20' and 'debug ms = 1' for the existing monitors leading up to the crash?

#2 Updated by Maciej Galkiewicz about 12 years ago

I have reinstalled ceph mon like I wrote but it has a different IP address now. Even though I have changed DNS record for it, existing mons still had old IP. I could not reproduce the error anymore (I have tried before updating IPs). Try to reproduce like this:

  • install 3 mons
  • shutdown one of them and point its DNS entry to IP of new mon
  • install new mon with the same id

Two old mons should have old IP address of the third one in ceph.conf. The new one should have updated IP. After updating addresses everything seems to work fine.

#3 Updated by Sage Weil about 12 years ago

  • Category set to Monitor
  • Status changed from Need More Info to Resolved

this really looks like the bug fixed in bfbeae68c045de76ede86ca4f72d2a760a19c84b... the sender sent a message with a bad rank, adn the receiver didn't validate. fixed that in 675e4c41bcfd66f6e3061d9cc557555cef971719

Also available in: Atom PDF