Project

General

Profile

Actions

Bug #21569

closed

after adding new mon to existing cluster all mons keep dying with FAILED assert(version >= summary.version) or FAILED assert(version >= pg_map.version)

Added by Tobias Fischer over 6 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

we have a Kraken (11.2.0) Cluster with 856 OSDs and 5 Mons running.
we wanted to replace 1 mon. unfortunately we did not realize that the new mon we added through ceph-deploy was using a newer kraken version (11.2.1). after removing the old mon all mons died.
we could fix this and downgrade the new mon to 11.2.0. Everything seemed fine so far. but now all our mons keep dying after about 1 hour. most of them restart but sometimes one of them keeps down. Here is an extract of mon log:

4> 2017-09-27 13:07:08.921705 7f0fe8104700 5 - 10.27.251.247:6789/0 >> 10.27.251.208:6789/0 conn(0x56483f8ea800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=201328 cs=1 l=0). rx mon.0 seq 46 0x564840bc3980 paxos(lease lc 67376027 fc 67375429 pn 0 opn 0) v3
3> 2017-09-27 13:07:08.922470 7f0fe8104700 5 - 10.27.251.247:6789/0 >> 10.27.251.208:6789/0 conn(0x56483f8ea800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=201328 cs=1 l=0). rx mon.0 seq 47 0x56484029aac0 route(pg_stats_ack(0 pgs tid 697488) v1 tid 45918) v3
2> 2017-09-27 13:07:08.922794 7f0fe8104700 5 - 10.27.251.247:6789/0 >> 10.27.251.208:6789/0 conn(0x56483f8ea800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=201328 cs=1 l=0). rx mon.0 seq 48 0x564843492580 paxos(begin lc 67376027 fc 0 pn 61600 opn 0) v3
1> 2017-09-27 13:07:08.922893 7f0fe8905700 5 - 10.27.251.247:6789/0 >> 10.27.251.247:0/3485083532 conn(0x5648401ad800 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx client.30278781 seq 7 0x564842737600 mon_command({"prefix": "get_command_descriptions"} v 0) v1
0> 2017-09-27 13:07:08.924120 7f0fc9533700 -1 /build/ceph-11.2.0/src/mon/LogMonitor.cc: In function 'virtual void LogMonitor::update_from_paxos(bool*)' thread 7f0fc9533700 time 2017-09-27 13:07:08.922042
/build/ceph-11.2.0/src/mon/LogMonitor.cc: 83: FAILED assert(version >= summary.version)
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x564833811a12]
2: (LogMonitor::update_from_paxos(bool*)+0x17a9) [0x564833729729]
3: (PaxosService::refresh(bool*)+0x1ce) [0x564833670d7e]
4: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x564833602c3b]
5: (Paxos::do_refresh()+0x49) [0x56483365b6e9]
6: (Paxos::handle_commit(std::shared_ptr<MonOpRequest>)+0x341) [0x564833663d71]
7: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x32c) [0x56483366ae3c]
8: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd25) [0x564833639ce5]
9: (Monitor::_ms_dispatch(Message*)+0x53f) [0x56483363a3af]
10: (Monitor::ms_dispatch(Message*)+0x23) [0x56483365a653]
11: (DispatchQueue::entry()+0x7e2) [0x5648339ee3e2]
12: (DispatchQueue::DispatchThread::entry()+0xd) [0x56483389fe1d]
13: (()+0x7424) [0x7f0fef035424]
14: (clone()+0x5f) [0x7f0fed2559bf]

12> 2017-09-27 14:44:35.822628 7f0b25b89700 1 -- 10.27.251.247:6789/0 >> 10.27.247.191:0/3724464501 conn(0x55dacb917000 :6789 s=STATE_OPEN pgs=68
962 cs=1 l=1).read_bulk peer close file descriptor 888
11> 2017-09-27 14:44:35.822638 7f0b25b89700 1 - 10.27.251.247:6789/0 >> 10.27.247.191:0/3724464501 conn(0x55dacb917000 :6789 s=STATE_OPEN pgs=68
962 cs=1 l=1).read_until read failed
10> 2017-09-27 14:44:35.822641 7f0b25b89700 1 - 10.27.251.247:6789/0 >> 10.27.247.191:0/3724464501 conn(0x55dacb917000 :6789 s=STATE_OPEN pgs=68
962 cs=1 l=1).process read tag failed
9> 2017-09-27 14:44:35.822644 7f0b25b89700 1 - 10.27.251.247:6789/0 >> 10.27.247.191:0/3724464501 conn(0x55dacb917000 :6789 s=STATE_OPEN pgs=68
962 cs=1 l=1).fault on lossy channel, failing
8> 2017-09-27 14:44:35.822648 7f0b25b89700 2 - 10.27.251.247:6789/0 >> 10.27.247.191:0/3724464501 conn(0x55dacb917000 :6789 s=STATE_OPEN pgs=68
962 cs=1 l=1)._stop
7> 2017-09-27 14:44:35.822653 7f0b25b89700 2 Event(0x55dac6685880 nevent=5000 time_id=2437).wakeup
-6> 2017-09-27 14:44:35.822666 7f0b26b8b700 1 -
10.27.251.247:6789/0 reap_dead start
5> 2017-09-27 14:44:35.822673 7f0b26b8b700 5 - 10.27.251.247:6789/0 reap_dead delete 0x55dac99af800
4> 2017-09-27 14:44:35.822683 7f0b26b8b700 5 - 10.27.251.247:6789/0 reap_dead delete 0x55dac9ff8000
3> 2017-09-27 14:44:35.822689 7f0b26b8b700 5 - 10.27.251.247:6789/0 reap_dead delete 0x55dacb917000
2> 2017-09-27 14:44:35.822693 7f0b26b8b700 5 - 10.27.251.247:6789/0 reap_dead delete 0x55dad4dfb800
1> 2017-09-27 14:44:35.822697 7f0b26b8b700 5 - 10.27.251.247:6789/0 reap_dead delete 0x55dad58a7800
0> 2017-09-27 14:44:35.822948 7f0b051fd700 -1 /build/ceph-11.2.0/src/mon/PGMonitor.cc: In function 'virtual void PGMonitor::update_from_paxos(boo
l*)' thread 7f0b051fd700 time 2017-09-27 14:44:35.820969
/build/ceph-11.2.0/src/mon/PGMonitor.cc: 170: FAILED assert(version >= pg_map.version)
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x55dabbe16a12]
2: (PGMonitor::update_from_paxos(bool*)+0x1359) [0x55dabbd6ab39]
3: (PaxosService::refresh(bool*)+0x1ce) [0x55dabbc75d7e]
4: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x55dabbc07c3b]
5: (Paxos::do_refresh()+0x49) [0x55dabbc606e9]
6: (Paxos::handle_commit(std::shared_ptr<MonOpRequest>)+0x341) [0x55dabbc68d71]
7: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x32c) [0x55dabbc6fe3c]
8: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd25) [0x55dabbc3ece5]
9: (Monitor::_ms_dispatch(Message*)+0x53f) [0x55dabbc3f3af]
10: (Monitor::ms_dispatch(Message*)+0x23) [0x55dabbc5f653]
11: (DispatchQueue::entry()+0x7e2) [0x55dabbff33e2]
12: (DispatchQueue::DispatchThread::entry()+0xd) [0x55dabbea4e1d]
13: (()+0x7424) [0x7f0b2d1ef424]
14: (clone()+0x5f) [0x7f0b2b40f9bf]

Is this a bug or did we crash something? Any help is appreciated. Thanks.

Actions #1

Updated by Sage Weil almost 3 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF