Project

General

Profile

Bug #8851

Mon crash after update to 0.80.4

Added by shaojun ruan over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Monitor
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

When I updated mon from 0.80.3 to 0.80.4, restart it then crashed
---------------------------------------------------------------------

root@SH176028:~/php-leveldb# /etc/init.d/ceph start mon === mon.a ===
Starting Ceph mon.a on SH176028...
mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2014-07-17 00:58:39.287322 7f88f4586780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret 0)

ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2014-07-17 00:58:39.287322 7f88f4586780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f88f4586780 time 2014-07-17 00:58:39.286306
mon/AuthMonitor.cc: 155: FAILED assert(ret == 0)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
2: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
4: (Monitor::init_paxos()+0xf5) [0x54a515]
5: (Monitor::preinit()+0x69f) [0x56291f]
6: (main()+0x2665) [0x534df5]
7: (__libc_start_main()+0xed) [0x7f88f258476d]
8: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
  • Caught signal (Aborted)
    in thread 7f88f4586780
    ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
    1: /usr/bin/ceph-mon() [0x87251a]
    2: (()+0xfcb0) [0x7f88f39b3cb0]
    3: (gsignal()+0x35) [0x7f88f2599425]
    4: (abort()+0x17b) [0x7f88f259cb8b]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
    6: (()+0xb5846) [0x7f88f2eea846]
    7: (()+0xb5873) [0x7f88f2eea873]
    8: (()+0xb596e) [0x7f88f2eea96e]
    9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
    10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
    11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
    12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
    13: (Monitor::init_paxos()+0xf5) [0x54a515]
    14: (Monitor::preinit()+0x69f) [0x56291f]
    15: (main()+0x2665) [0x534df5]
    16: (__libc_start_main()+0xed) [0x7f88f258476d]
    17: /usr/bin/ceph-mon() [0x537bf9]
    2014-07-17 00:58:39.321298 7f88f4586780 -1
    Caught signal (Aborted) *
    in thread 7f88f4586780
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-mon() [0x87251a]
2: (()+0xfcb0) [0x7f88f39b3cb0]
3: (gsignal()+0x35) [0x7f88f2599425]
4: (abort()+0x17b) [0x7f88f259cb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
6: (()+0xb5846) [0x7f88f2eea846]
7: (()+0xb5873) [0x7f88f2eea873]
8: (()+0xb596e) [0x7f88f2eea96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
13: (Monitor::init_paxos()+0xf5) [0x54a515]
14: (Monitor::preinit()+0x69f) [0x56291f]
15: (main()+0x2665) [0x534df5]
16: (__libc_start_main()+0xed) [0x7f88f258476d]
17: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2014-07-17 00:58:39.321298 7f88f4586780 -1 ** Caught signal (Aborted) *
in thread 7f88f4586780
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: /usr/bin/ceph-mon() [0x87251a]
2: (()+0xfcb0) [0x7f88f39b3cb0]
3: (gsignal()+0x35) [0x7f88f2599425]
4: (abort()+0x17b) [0x7f88f259cb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f88f2eec69d]
6: (()+0xb5846) [0x7f88f2eea846]
7: (()+0xb5873) [0x7f88f2eea873]
8: (()+0xb596e) [0x7f88f2eea96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x77252f]
10: (AuthMonitor::update_from_paxos(bool*)+0x21a6) [0x6611d6]
11: (PaxosService::refresh(bool*)+0x445) [0x5b05b5]
12: (Monitor::refresh_from_paxos(bool*)+0x57) [0x54a347]
13: (Monitor::init_paxos()+0xf5) [0x54a515]
14: (Monitor::preinit()+0x69f) [0x56291f]
15: (main()+0x2665) [0x534df5]
16: (__libc_start_main()+0xed) [0x7f88f258476d]
17: /usr/bin/ceph-mon() [0x537bf9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

[20660]: (33) Numerical argument out of domain
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf --cluster ceph '


Related issues

Duplicated by Ceph - Bug #9535: monitor crashed after restarting Duplicate 09/19/2014

Associated revisions

Revision b551ae2b (diff)
Added by Joao Eduardo Luis over 5 years ago

mon: AuthMonitor: always encode full regardless of keyserver having keys

On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers. A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.

As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals. This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk. This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.

Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first. If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids. As such, we expect to read version 1, then version 2,
and so on. If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.

This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.

Fixes: #8851
Backport: dumpling, firefly

Signed-off-by: Joao Eduardo Luis <>

Revision 5f4ceb20 (diff)
Added by Joao Eduardo Luis over 5 years ago

mon: AuthMonitor: always encode full regardless of keyserver having keys

On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers. A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.

As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals. This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk. This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.

Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first. If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids. As such, we expect to read version 1, then version 2,
and so on. If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.

This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.

Fixes: #8851
Backport: dumpling, firefly

Signed-off-by: Joao Eduardo Luis <>
(cherry picked from commit b551ae2bcea2dd17b37f5f5ab34251cc78de0e26)

History

#1 Updated by shaojun ruan over 5 years ago

it can be temporarily resolved by this command?
-------------------------------------------------
ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0

#2 Updated by Greg Farnum over 5 years ago

Can you upload the full log of startup with crash?
By "temporarily resolved", do you mean it's working now, or does the issue resurrect itself?

#3 Updated by Joao Eduardo Luis over 5 years ago

  • Status changed from New to 7
  • Assignee set to Joao Eduardo Luis

This issue should only affect users that have been running without cephx and have not ever created a key.

It's due to the AuthMonitor not being encoding a full version if there are no keys.

Reproducing this is trivial but takes some time. Reproducing with an unchanged vstart.sh deployment is not straightforward as vstart will add keys even if cephx is not being used, thus creating the right conditions for this to not being reproduced.

Easy steps to reproduce this:

- manually create 2+ monitors, cephx disabled
- set the following options on ceph.conf:

 mon globalid prealloc = 1
 paxos service trim min = 10
 paxos service trim max = 20

- run 'ceph log foo ; sleep 1' some 50 times (use the loops luke)
- stop mons
- use ceph-kvstore-tool path/to/store.db list auth to see if version 0 still exists; if not continue, else restart mons and go back to the loop part
- restart mons for crash

#4 Updated by Joao Eduardo Luis over 5 years ago

  • Status changed from 7 to Fix Under Review

#5 Updated by shaojun ruan over 5 years ago

Greg Farnum wrote:

Can you upload the full log of startup with crash?
By "temporarily resolved", do you mean it's working now, or does the issue resurrect itself?

yes it's working and not crash again

#6 Updated by Sage Weil over 5 years ago

  • Status changed from Fix Under Review to Resolved

#7 Updated by wei li over 5 years ago

In our product env, we use 0.83. Coming accross this problem too.
Try this patch https://github.com/ceph/ceph/pull/2128, rebulid and redeploy the ceph-mon, it still has the same problem.
And I also try the work around "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0", restart the ceph-mon, it still the same.

My understand this patch only fix the new deploy env, but for the env which already has user data, it seems need other work around.
In my env, there has three mons, mon.0 crash, and can not start. Now only mon.1 and mon.2 work. If they crashed too, the whole env will down.

#8 Updated by wei li over 5 years ago

In our product env, we use 0.83. Coming accross this problem too.
Try this patch https://github.com/ceph/ceph/pull/2128, rebulid and redeploy the ceph-mon, it still has the same problem.
And I also try the work around "ceph-kvstore-tool /var/lib/ceph/mon/store.db set auth last_committed ver 0", restart the ceph-mon, it still the same.

My understand this patch only fix the new deploy env, but for the env which already has user data, it seems need other work around.
In my env, there has three mons, mon.0 crash, and can not start. Now only mon.1 and mon.2 work. If they crashed too, the whole env will down.

Also available in: Atom PDF