Bug #8851
closedMon crash after update to 0.80.4
0%
Description
When I updated mon from 0.80.3 to 0.80.4, restart it then crashed
---------------------------------------------------------------------
root@SH176028:~/php-leveldb# /etc/init.d/ceph start mon
=== mon.a ===
Starting Ceph mon.a on SH176028...
mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-07-17 00:58:39.287322 7f88f4586780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2014-07-17 00:58:39.287322 7f88f4586780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
- Caught signal (Aborted)
in thread 7f88f4586780
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-mon() [0x87251a]
2: (()+0xfcb0) [0x7f88f39b3cb0]
3: (gsignal()+0x35) [0x7f88f2599425]
4: (abort()+0x17b) [0x7f88f259cb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
6: (()+0xb5846) [0x7f88f2eea846]
7: (()+0xb5873) [0x7f88f2eea873]
8: (()+0xb596e) [0x7f88f2eea96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
13: (Monitor::init_paxos()+0xf5) [0x54a515]
14: (Monitor::preinit()+0x69f) [0x56291f]
15: (main()+0x2665) [0x534df5]
16: (__libc_start_main()+0xed) [0x7f88f258476d]
17: /usr/bin/ceph-mon() [0x537bf9]
2014-07-17 00:58:39.321298 7f88f4586780 -1 Caught signal (Aborted) *
in thread 7f88f4586780
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-mon() [0x87251a]
2: (()+0xfcb0) [0x7f88f39b3cb0]
3: (gsignal()+0x35) [0x7f88f2599425]
4: (abort()+0x17b) [0x7f88f259cb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
6: (()+0xb5846) [0x7f88f2eea846]
7: (()+0xb5873) [0x7f88f2eea873]
8: (()+0xb596e) [0x7f88f2eea96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
13: (Monitor::init_paxos()+0xf5) [0x54a515]
14: (Monitor::preinit()+0x69f) [0x56291f]
15: (main()+0x2665) [0x534df5]
16: (__libc_start_main()+0xed) [0x7f88f258476d]
17: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2014-07-17 00:58:39.321298 7f88f4586780 -1 ** Caught signal (Aborted) *
in thread 7f88f4586780
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-mon() [0x87251a]
2: (()+0xfcb0) [0x7f88f39b3cb0]
3: (gsignal()+0x35) [0x7f88f2599425]
4: (abort()+0x17b) [0x7f88f259cb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
6: (()+0xb5846) [0x7f88f2eea846]
7: (()+0xb5873) [0x7f88f2eea873]
8: (()+0xb596e) [0x7f88f2eea96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
13: (Monitor::init_paxos()+0xf5) [0x54a515]
14: (Monitor::preinit()+0x69f) [0x56291f]
15: (main()+0x2665) [0x534df5]
16: (__libc_start_main()+0xed) [0x7f88f258476d]
17: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
[20660]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf --cluster ceph '
Updated by shaojun ruan almost 10 years ago
it can be temporarily resolved by this command?
-------------------------------------------------
ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0
Updated by Greg Farnum almost 10 years ago
Can you upload the full log of startup with crash?
By "temporarily resolved", do you mean it's working now, or does the issue resurrect itself?
Updated by Joao Eduardo Luis almost 10 years ago
- Status changed from New to 7
- Assignee set to Joao Eduardo Luis
This issue should only affect users that have been running without cephx and have not ever created a key.
It's due to the AuthMonitor not being encoding a full version if there are no keys.
Reproducing this is trivial but takes some time. Reproducing with an unchanged vstart.sh deployment is not straightforward as vstart will add keys even if cephx is not being used, thus creating the right conditions for this to not being reproduced.
Easy steps to reproduce this:
- manually create 2+ monitors, cephx disabled
- set the following options on ceph.conf:
mon globalid prealloc = 1 paxos service trim min = 10 paxos service trim max = 20
- run 'ceph log foo ; sleep 1' some 50 times (use the loops luke)
- stop mons
- use ceph-kvstore-tool path/to/store.db list auth to see if version 0 still exists; if not continue, else restart mons and go back to the loop part
- restart mons for crash
Updated by Joao Eduardo Luis almost 10 years ago
- Status changed from 7 to Fix Under Review
Updated by shaojun ruan almost 10 years ago
Greg Farnum wrote:
Can you upload the full log of startup with crash?
By "temporarily resolved", do you mean it's working now, or does the issue resurrect itself?
yes it's working and not crash again
Updated by Sage Weil almost 10 years ago
- Status changed from Fix Under Review to Resolved
Updated by wei li over 9 years ago
In our product env, we use 0.83. Coming accross this problem too.
Try this patch https://github.com/ceph/ceph/pull/2128, rebulid and redeploy the ceph-mon, it still has the same problem.
And I also try the work around "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0", restart the ceph-mon, it still the same.
My understand this patch only fix the new deploy env, but for the env which already has user data, it seems need other work around.
In my env, there has three mons, mon.0 crash, and can not start. Now only mon.1 and mon.2 work. If they crashed too, the whole env will down.
Updated by wei li over 9 years ago
In our product env, we use 0.83. Coming accross this problem too.
Try this patch https://github.com/ceph/ceph/pull/2128, rebulid and redeploy the ceph-mon, it still has the same problem.
And I also try the work around "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0", restart the ceph-mon, it still the same.
My understand this patch only fix the new deploy env, but for the env which already has user data, it seems need other work around.
In my env, there has three mons, mon.0 crash, and can not start. Now only mon.1 and mon.2 work. If they crashed too, the whole env will down.