Project

General

Profile

Actions

Bug #8851

closed

Mon crash after update to 0.80.4

Added by shaojun ruan almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When I updated mon from 0.80.3 to 0.80.4, restart it then crashed
---------------------------------------------------------------------

root@SH176028:~/php-leveldb# /etc/init.d/ceph start mon === mon.a ===
Starting Ceph mon.a on SH176028...
mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-07-17 00:58:39.287322 7f88f4586780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret 0)

ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2014-07-17 00:58:39.287322 7f88f4586780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
  • Caught signal (Aborted)
    in thread 7f88f4586780
    ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
    1: /usr/bin/ceph-mon() [0x87251a]
    2: (()+0xfcb0) [0x7f88f39b3cb0]
    3: (gsignal()+0x35) [0x7f88f2599425]
    4: (abort()+0x17b) [0x7f88f259cb8b]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
    6: (()+0xb5846) [0x7f88f2eea846]
    7: (()+0xb5873) [0x7f88f2eea873]
    8: (()+0xb596e) [0x7f88f2eea96e]
    9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
    10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
    11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
    12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
    13: (Monitor::init_paxos()+0xf5) [0x54a515]
    14: (Monitor::preinit()+0x69f) [0x56291f]
    15: (main()+0x2665) [0x534df5]
    16: (__libc_start_main()+0xed) [0x7f88f258476d]
    17: /usr/bin/ceph-mon() [0x537bf9]
    2014-07-17 00:58:39.321298 7f88f4586780 -1
    Caught signal (Aborted) *
    in thread 7f88f4586780
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-mon() [0x87251a]
2: (()+0xfcb0) [0x7f88f39b3cb0]
3: (gsignal()+0x35) [0x7f88f2599425]
4: (abort()+0x17b) [0x7f88f259cb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
6: (()+0xb5846) [0x7f88f2eea846]
7: (()+0xb5873) [0x7f88f2eea873]
8: (()+0xb596e) [0x7f88f2eea96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
13: (Monitor::init_paxos()+0xf5) [0x54a515]
14: (Monitor::preinit()+0x69f) [0x56291f]
15: (main()+0x2665) [0x534df5]
16: (__libc_start_main()+0xed) [0x7f88f258476d]
17: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2014-07-17 00:58:39.321298 7f88f4586780 -1 ** Caught signal (Aborted) *
in thread 7f88f4586780
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-mon() [0x87251a]
2: (()+0xfcb0) [0x7f88f39b3cb0]
3: (gsignal()+0x35) [0x7f88f2599425]
4: (abort()+0x17b) [0x7f88f259cb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
6: (()+0xb5846) [0x7f88f2eea846]
7: (()+0xb5873) [0x7f88f2eea873]
8: (()+0xb596e) [0x7f88f2eea96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
13: (Monitor::init_paxos()+0xf5) [0x54a515]
14: (Monitor::preinit()+0x69f) [0x56291f]
15: (main()+0x2665) [0x534df5]
16: (__libc_start_main()+0xed) [0x7f88f258476d]
17: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

[20660]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf --cluster ceph '


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #9535: monitor crashed after restartingDuplicate09/19/2014

Actions
Actions #1

Updated by shaojun ruan almost 10 years ago

it can be temporarily resolved by this command?
-------------------------------------------------
ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0

Actions #2

Updated by Greg Farnum almost 10 years ago

Can you upload the full log of startup with crash?
By "temporarily resolved", do you mean it's working now, or does the issue resurrect itself?

Actions #3

Updated by Joao Eduardo Luis over 9 years ago

  • Status changed from New to 7
  • Assignee set to Joao Eduardo Luis

This issue should only affect users that have been running without cephx and have not ever created a key.

It's due to the AuthMonitor not being encoding a full version if there are no keys.

Reproducing this is trivial but takes some time. Reproducing with an unchanged vstart.sh deployment is not straightforward as vstart will add keys even if cephx is not being used, thus creating the right conditions for this to not being reproduced.

Easy steps to reproduce this:

- manually create 2+ monitors, cephx disabled
- set the following options on ceph.conf:

 mon globalid prealloc = 1
 paxos service trim min = 10
 paxos service trim max = 20

- run 'ceph log foo ; sleep 1' some 50 times (use the loops luke)
- stop mons
- use ceph-kvstore-tool path/to/store.db list auth to see if version 0 still exists; if not continue, else restart mons and go back to the loop part
- restart mons for crash

Actions #4

Updated by Joao Eduardo Luis over 9 years ago

  • Status changed from 7 to Fix Under Review
Actions #5

Updated by shaojun ruan over 9 years ago

Greg Farnum wrote:

Can you upload the full log of startup with crash?
By "temporarily resolved", do you mean it's working now, or does the issue resurrect itself?

yes it's working and not crash again

Actions #6

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions #7

Updated by wei li over 9 years ago

In our product env, we use 0.83. Coming accross this problem too.
Try this patch https://github.com/ceph/ceph/pull/2128, rebulid and redeploy the ceph-mon, it still has the same problem.
And I also try the work around "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0", restart the ceph-mon, it still the same.

My understand this patch only fix the new deploy env, but for the env which already has user data, it seems need other work around.
In my env, there has three mons, mon.0 crash, and can not start. Now only mon.1 and mon.2 work. If they crashed too, the whole env will down.

Actions #8

Updated by wei li over 9 years ago

In our product env, we use 0.83. Coming accross this problem too.
Try this patch https://github.com/ceph/ceph/pull/2128, rebulid and redeploy the ceph-mon, it still has the same problem.
And I also try the work around "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0", restart the ceph-mon, it still the same.

My understand this patch only fix the new deploy env, but for the env which already has user data, it seems need other work around.
In my env, there has three mons, mon.0 crash, and can not start. Now only mon.1 and mon.2 work. If they crashed too, the whole env will down.

Actions

Also available in: Atom PDF