Project

General

Profile

Actions

Bug #43365

closed

Nautilus: Random mon crashes in failed assertion at ceph::time_detail::signedspan

Added by Alex Walender over 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):

6a617f9d477ab8df2d068af0768ff741c68adabcc5c1ecb5dd3e9872d613c943
dacbff55030f3d0837e58d8f4961441b6902d5750b0e1579682df5650c33d44d
ba2236bee8cb7fa239e26b19bf4ec3fcd2245c1a3ea0decd85d6e183c5476f91
a1d05027be7f31340a919c0c79a91d7be71d547503fa54fdf0b74448ccd3b7f7
f7126ff3d59617640b56f9aaaa317a56b8827fc299ba37f1e727b8ac13ce0e1a
cf2864eb1281dffc3340730dc2caae163b4c0170132bcbd3dcbd6147d8f29fa8

Crash signature (v2):

Description

Thanks to 14.2.5 auto warning for recent crashes, we are observing frequent (somewhat daily period) random crashes of our monitors.

``` {
"crash_id": "2019-12-18_10:56:08.666582Z_408b4d27-47e2-43bf-849e-87b62a9c6084",
"timestamp": "2019-12-18 10:56:08.666582Z",
"process_name": "ceph-mon",
"entity_name": "mon.weser",
"ceph_version": "14.2.5",
"utsname_hostname": "weser",
"utsname_sysname": "Linux",
"utsname_release": "5.0.0-27-generic",
"utsname_version": "#28~18.04.1-Ubuntu SMP Thu Aug 22 03:00:32 UTC 2019",
"utsname_machine": "x86_64",
"os_name": "Ubuntu",
"os_id": "ubuntu",
"os_version_id": "18.04",
"os_version": "18.04.3 LTS (Bionic Beaver)",
"assert_condition": "z >= signedspan::zero()",
"assert_func": "ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)",
"assert_file": "/build/ceph-14.2.5/src/common/ceph_time.h",
"assert_line": 485,
"assert_thread_name": "fn_monstore",
"assert_msg": "/build/ceph-14.2.5/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7fc3198ac700 time 2019-12-18 11:56:08.662665\n/build/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())\n",
"backtrace": [
"(()+0x12890) [0x7fc3284fc890]",
"(gsignal()+0xc7) [0x7fc3275f4e97]",
"(abort()+0x141) [0x7fc3275f6801]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7fc3296a0287]",
"(()+0x28140e) [0x7fc3296a040e]",
"(Paxos::begin(ceph::buffer::v14_2_0::list&)+0xcd2) [0x5571b4b2f412]",
"(Paxos::propose_pending()+0x127) [0x5571b4b305f7]",
"(Paxos::finish_round()+0x50a) [0x5571b4b30e1a]",
"(Paxos::commit_finish()+0x5fc) [0x5571b4b32d6c]",
"(C_Committed::finish(int)+0x34) [0x5571b4b36d54]",
"(Context::complete(int)+0x9) [0x5571b4a6d359]",
"(MonitorDBStore::C_DoTransaction::finish(int)+0x94) [0x5571b4b36ac4]",
"(Context::complete(int)+0x9) [0x5571b4a6d359]",
"(Finisher::finisher_thread_entry()+0x17f) [0x7fc32972b79f]",
"(()+0x76db) [0x7fc3284f16db]",
"(clone()+0x3f) [0x7fc3276d788f]"
]
}
```

Crashes seem to appear out of random. I'm not able to provoke it directly.

Since 14.2.5, we also have issues with our relationship between mon/mgr described here:
https://tracker.ceph.com/issues/43364

Not sure if this exception is related to this monitor crashes.


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #44078: centos 8.1: ceph-mon: assertion "z >= signedspan::zero()" failed in ceph::to_timespan(Duplicate

Actions
Copied to RADOS - Backport #44486: nautilus: Nautilus: Random mon crashes in failed assertion at ceph::time_detail::signedspanResolvedNathan CutlerActions
Actions #1

Updated by Greg Farnum over 4 years ago

So it's asserting inside of to_timespan, and the Paxos code triggering that assert is

auto start = ceph::coarse_mono_clock::now();
get_store()->apply_transaction(t);
auto end = ceph::coarse_mono_clock::now();

logger->tinc(l_paxos_begin_latency, to_timespan(end - start));

It's a bit concerning if that code snippet somehow comes out with "end" being smaller than "start"; not doing that is the purpose of a mono clock. Even if it's coarse and doesn't update they should still be equal values...

Actions #2

Updated by Neha Ojha over 4 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Glen Aidukas over 4 years ago

We are also running into this issue.

Jan 07 19:03:42 pmxc05 ceph-mon3701783: 2020-01-07 19:03:42.625 7fe59c03d700 -1 mon.pmxc05@4(electing) e5 failed to get devid for : fallback method has serial ''but no model
Jan 08 00:00:00 pmxc05 ceph-mon3701783: 2020-01-08 00:00:00.378 7fe5a284a700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 4001791) UID: 0
Jan 08 00:00:00 pmxc05 ceph-mon3701783: 2020-01-08 00:00:00.394 7fe5a284a700 -1 received signal: Hangup from (PID: 4001792) UID: 0
Jan 08 04:12:11 pmxc05 ceph-mon3701783: /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7fe59c03d700 time 2020-01-08 04:12:11.394002
Jan 08 04:12:11 pmxc05 ceph-mon3701783: /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7fe5a60c1e28]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: * Caught signal (Aborted) *
Jan 08 04:12:11 pmxc05 ceph-mon3701783: in thread 7fe59c03d700 thread_name:ms_dispatch
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2020-01-08 04:12:11.394 7fe59c03d700 -1 /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7fe59c03d700 time 2020-01-08 04:12:11.394002
Jan 08 04:12:11 pmxc05 ceph-mon3701783: /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x152) [0x7fe5a60c1e28]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (()+0x12730) [0x7fe5a4f83730]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (gsignal()+0x10b) [0x7fe5a4a667bb]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (abort()+0x121) [0x7fe5a4a51535]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7fe5a60c1e79]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 15: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 16: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2020-01-08 04:12:11.394 7fe59c03d700 -1
Caught signal (Aborted)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: in thread 7fe59c03d700 thread_name:ms_dispatch
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (()+0x12730) [0x7fe5a4f83730]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (gsignal()+0x10b) [0x7fe5a4a667bb]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (abort()+0x121) [0x7fe5a4a51535]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x7fe5a60c1e79]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 15: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 16: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 08 04:12:11 pmxc05 ceph-mon3701783: -9999> 2020-01-08 04:12:11.394 7fe59c03d700 -1 /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7fe59c03d700 time 2020-01-08 04:12:1
Jan 08 04:12:11 pmxc05 ceph-mon3701783: /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7fe5a60c1e28]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: -9998> 2020-01-08 04:12:11.394 7fe59c03d700 -1
Caught signal (Aborted)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: in thread 7fe59c03d700 thread_name:ms_dispatch
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (()+0x12730) [0x7fe5a4f83730]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (gsignal()+0x10b) [0x7fe5a4a667bb]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (abort()+0x121) [0x7fe5a4a51535]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x7fe5a60c1e79]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 15: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 16: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 08 04:12:11 pmxc05 ceph-mon3701783: -1> 2020-01-08 04:12:11.394 7fe59c03d700 -1 /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7fe59c03d700 time 2020-01-08 04:12:1
Jan 08 04:12:11 pmxc05 ceph-mon3701783: /mnt/npool/tlamprecht/pve-ceph/ceph-14.2.5/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7fe5a60c1e28]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 0> 2020-01-08 04:12:11.394 7fe59c03d700 -1
Caught signal (Aborted) *
Jan 08 04:12:11 pmxc05 ceph-mon3701783: in thread 7fe59c03d700 thread_name:ms_dispatch
Jan 08 04:12:11 pmxc05 ceph-mon3701783: ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable)
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 1: (()+0x12730) [0x7fe5a4f83730]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 2: (gsignal()+0x10b) [0x7fe5a4a667bb]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 3: (abort()+0x121) [0x7fe5a4a51535]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7fe5a60c1e79]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 5: (()+0x282000) [0x7fe5a60c2000]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 6: (Paxos::do_refresh()+0x1a4) [0x55706be4e4b4]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 7: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55706be53a62]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 8: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55706be59213]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55706bd8eb1c]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 10: (Monitor::_ms_dispatch(Message*)+0x4aa) [0x55706bd8f10a]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 11: (Monitor::ms_dispatch(Message*)+0x26) [0x55706bdbea36]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55706bdbaf66]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 13: (DispatchQueue::entry()+0x1a49) [0x7fe5a62fae69]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe5a63a89ed]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 15: (()+0x7fa3) [0x7fe5a4f78fa3]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: 16: (clone()+0x3f) [0x7fe5a4b284cf]
Jan 08 04:12:11 pmxc05 ceph-mon3701783: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jan 08 04:12:11 pmxc05 systemd1: : Main process exited, code=killed, status=6/ABRT
Jan 08 04:12:11 pmxc05 systemd1: : Failed with result 'signal'.
Jan 08 04:12:21 pmxc05 systemd1: : Service RestartSec=10s expired, scheduling restart.
Jan 08 04:12:21 pmxc05 systemd1: : Scheduled restart job, restart counter is at 2.
Jan 08 04:12:21 pmxc05 systemd1: Stopped Ceph cluster monitor daemon.
Jan 08 04:12:21 pmxc05 systemd1: Started Ceph cluster monitor daemon.

Actions #4

Updated by Greg Farnum over 4 years ago

Hmm, what hardware are you guys running this on?

We consulted with our C++ time guru and the current theory is a bug somewhere between the C library and the hardware. o_0

Actions #5

Updated by Alex Walender over 4 years ago

Hello Greg,

I have digged a little bit deeper between the appearance of the crashes and the states of our machines.
First of all a few specs:

Supermicro X11DPi-N(T)/X11DPi-N, BIOS 2.0 11/28/2017
microcode: microcode updated early to revision 0x200005e, date = 2019-04-02
Linux version 5.0.0-27-generic (buildd@lgw01-amd64-031) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #28~18.04.1-Ubuntu SMP Thu Aug 22 03:00:32 UTC 2019 (Ubuntu 5.0.0-27.28~18.04.1-generic 5.0.21)
Ubuntu 18.04
Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

I suspect the 5.0.0-27 kernel.
In our history, we had massive problems with our Intel XL710 40G cards which was due to outdated drivers in the lts ubuntu 18.04 kernel (4.15.x). As a solution, we escaped to the linux-hwe 5.0.0 kernel which solved our driver problem.

I checked the time when we upgraded the kernels on our machines and they match the appearance of the random mon crashes. I think this is the culprit since we didn't had the crashes before the upgrade.

Actions #6

Updated by Andreas Grip about 4 years ago

I also have this problem

I’m running on:

Supermicro X11DPU
Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz
10GBase-T ports via AOC-UR-i4XT (Intel® X540 10GBase-T)
Proxmox Linux 5.3.13-1-pve #1 SMP PVE 5.3.13-1

Actions #7

Updated by Zoltan Fodor about 4 years ago

We also have a problem:

{
"os_version_id": "10",
"assert_condition": "z >= signedspan::zero()",
"utsname_release": "5.3.13-1-pve",
"os_name": "Debian GNU/Linux 10 (buster)",
"entity_name": "mon.node2",
"assert_file": "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h",
"timestamp": "2020-01-26 01:37:20.520521Z",
"process_name": "ceph-mon",
"utsname_machine": "x86_64",
"assert_line": 485,
"utsname_sysname": "Linux",
"os_version": "10 (buster)",
"os_id": "10",
"assert_thread_name": "ms_dispatch",
"utsname_version": "#1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100)",
"backtrace": [
"(()+0x12730) [0x7f2d1e8ee730]",
"(gsignal()+0x10b) [0x7f2d1e3d17bb]",
"(abort()+0x121) [0x7f2d1e3bc535]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f2d1fa2fe79]",
"(()+0x282000) [0x7f2d1fa30000]",
"(Paxos::do_refresh()+0x1a4) [0x55b60af7d4b4]",
"(Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2f2) [0x55b60af82a62]",
"(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223) [0x55b60af88213]",
"(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c) [0x55b60aebdb1c]",
"(Monitor::_ms_dispatch(Message*)+0x4aa) [0x55b60aebe10a]",
"(Monitor::ms_dispatch(Message*)+0x26) [0x55b60aeeda36]",
"(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x55b60aee9f66]",
"(DispatchQueue::entry()+0x1a49) [0x7f2d1fc68e69]",
"(DispatchQueue::DispatchThread::entry()+0xd) [0x7f2d1fd169ed]",
"(()+0x7fa3) [0x7f2d1e8e3fa3]",
"(clone()+0x3f) [0x7f2d1e4934cf]"
],
"utsname_hostname": "node2",
"assert_msg": "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f2d159a5700 time 2020-01-26 02:37:20.516630\n/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())\n",
"crash_id": "2020-01-26_01:37:20.520521Z_9fcc4125-6247-4902-93bc-20837c781eb2",
"assert_func": "ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)",
"ceph_version": "14.2.6"
}

Actions #8

Updated by Alex Walender about 4 years ago

Just an update on my side:

After upgrading our monitor Ubuntu 18.04 packages (apt-get upgrade) with the 5.3.0-26-generic kernel, it seems that the crashes have been fixed (they run stable now for 8 days).
Knock on wood.

Actions #9

Updated by Sage Weil about 4 years ago

  • Priority changed from High to Urgent
  • Crash signature (v1) updated (diff)
Actions #10

Updated by Sage Weil about 4 years ago

  • Crash signature (v1) updated (diff)
Actions #11

Updated by Sage Weil about 4 years ago

  • Status changed from New to Closed

FWIW teh two clusters reporting this crash via telemetry are both ubuntu 18.04

closing this as not a ceph issue; let's reopen if it comes up in another context!

Actions #12

Updated by Radoslaw Zarzynski about 4 years ago

Saw something similar but on Centos 8: https://tracker.ceph.com/issues/44078.
Marking as related for now, possibly it will be a duplicate.

Actions #13

Updated by Radoslaw Zarzynski about 4 years ago

  • Related to Bug #44078: centos 8.1: ceph-mon: assertion "z >= signedspan::zero()" failed in ceph::to_timespan( added
Actions #14

Updated by Alex Walender about 4 years ago

I'd like to reopen this, since there are now reports about crashes on Centos (see possible duplicate linked to this issue). Also after 2 Weeks on the newest hwe-kernel, we once again saw a crash.

Actions #15

Updated by Sage Weil about 4 years ago

  • Crash signature (v1) updated (diff)
Actions #16

Updated by Sage Weil about 4 years ago

  • Crash signature (v1) updated (diff)
Actions #17

Updated by Sage Weil about 4 years ago

  • Crash signature (v1) updated (diff)
Actions #18

Updated by Brad Hubbard about 4 years ago

  • Crash signature (v1) updated (diff)

Adding crash signature (cf2864eb1281dffc3340730dc2caae163b4c0170132bcbd3dcbd6147d8f29fa8) for the crash described in comment 7

Actions #19

Updated by Brad Hubbard about 4 years ago

Likely related.

assert_msg:  /build/ceph-14.2.7/src/common/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)': FAILED ceph_assert(z >= signedspan::zero())
stack:
         ()
        gsignal()
        abort()
        ceph::__ceph_assert_fail(char const*, char const*, int, char const*)
        ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)
        Paxos::do_refresh()
        Paxos::commit_finish()
        C_Committed::finish(int)
        Context::complete(int)
        MonitorDBStore::C_DoTransaction::finish(int)
        Context::complete(int)
        Finisher::finisher_thread_entry()
        ()
        clone()
Actions #20

Updated by Edwin Pers about 4 years ago

We're seeing this a couple times a day on debian 10.1, using croit's repo:
kernel 4.19.67-2+deb10u1
ceph version 14.2.4 (ad875bdeab69b822458d91016d5b843be4a33550)

root@hostname.fqdn [/root] $ while true; do ceph health|grep -q HEALTH_OK || (date; ceph health detail); sleep 2; done

Wed 26 Feb 2020 06:41:27 AM UTC
HEALTH_OK
Wed 26 Feb 2020 08:02:24 AM UTC
HEALTH_WARN 1/3 mons down, quorum mon1,mon3
MON_DOWN 1/3 mons down, quorum mon1,mon3
    mon.mon2 (rank 1) addr [v2:10.15.30.6:3300/0,v1:10.15.30.6:6789/0] is down (out of quorum)
Wed 26 Feb 2020 08:48:59 AM UTC
HEALTH_WARN 1/3 mons down, quorum mon2,mon3
MON_DOWN 1/3 mons down, quorum mon2,mon3
    mon.mon1 (rank 0) addr [v2:10.15.30.5:3300/0,v1:10.15.30.5:6789/0] is down (out of quorum)
Wed 26 Feb 2020 08:49:02 AM UTC
HEALTH_WARN 1/3 mons down, quorum mon2,mon3
MON_DOWN 1/3 mons down, quorum mon2,mon3
    mon.mon1 (rank 0) addr [v2:10.15.30.5:3300/0,v1:10.15.30.5:6789/0] is down (out of quorum)
Wed 26 Feb 2020 10:43:55 AM UTC
HEALTH_WARN 1/3 mons down, quorum mon2,mon3
MON_DOWN 1/3 mons down, quorum mon2,mon3
    mon.mon1 (rank 0) addr [v2:10.15.30.5:3300/0,v1:10.15.30.5:6789/0] is down (out of quorum)
Wed 26 Feb 2020 10:43:58 AM UTC
HEALTH_OK
Wed 26 Feb 2020 01:11:34 PM UTC
HEALTH_WARN 1/3 mons down, quorum mon1,mon2
MON_DOWN 1/3 mons down, quorum mon1,mon2
    mon.mon3 (rank 2) addr [v2:10.15.30.7:3300/0,v1:10.15.30.7:6789/0] is down (out of quorum)

Hardware in question is Dell poweredge r210 using Intel Xeon X3450 and these nics:

root@hostname.fqdn [/root] $ lspci | grep net
01:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
02:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM5716 Gigabit Ethernet (rev 20)

dmidecode, /proc/cpuinfo, and lspci output:
https://gist.github.com/epers/3524bcd4e9afc0c500b32f5242331e80

I believe we've seen it since we did the initial deployment with 14.2.2 but didn't investigate it at the time so can't confirm that it's the same thing. Although we did have the same symptoms back then - mon's randomly dying every now and then.

root@hostname.fqdn [/root] $ journalctl -exu ceph-mon@mon2.service --no-pager | grep -e "Feb 26 08:02" -e "Feb 26 08:03" 
Feb 26 08:02:09 mon2 ceph-mon[3616584]: /tmp/release/Debian/WORKDIR/ceph-14.2.4/src/common/ceph_time.h: 
  In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' 
  thread 7f638275f700 time 2020-02-26 08:02:09.248625
-- snip --

Full stacktrace:
https://gist.github.com/epers/0440e34e9b16ba5a69a1222bd1653a92

We plan to update to 14.2.7 "soon" (possibly this weekend) so we'll see if it persists.

Actions #21

Updated by David DELON about 4 years ago

Hi,

same behaviour for us: one of the 3 mons crashes randomly, nearly once per day.
We are using Ceph 14.2.6 PVE under Debian 10.3:

Kernel 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux
Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz

Actions #22

Updated by Sage Weil about 4 years ago

  • Status changed from Closed to Fix Under Review
  • Backport set to nautilus
  • Pull request ID set to 33699
Actions #23

Updated by Kefu Chai about 4 years ago

2020-03-04T02:48:43.792 INFO:teuthology.orchestra.run.smithi187.stdout:Package kernel-4.18.0-147.el8.x86_64 is already installed.
....

    -5> 2020-03-04T03:15:00.077+0000 7f7e49693700 10 mon.c@2(peon).paxos(paxos updating c 1507..2185) handle_commit on 2186
    -4> 2020-03-04T03:15:00.077+0000 7f7e49693700 10 mon.c@2(peon).paxos(paxos updating c 1507..2186) store_state [2186..2186]
    -3> 2020-03-04T03:15:00.078+0000 7f7e49693700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1655-g4f16a0b/rpm/el8/BUILD/ceph-15.1.0-1655-g4f16a0b/src/com
mon/ceph_time.h: In function 'ceph::time_detail::timespan ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f7e49693700 time 2020-03-04T03:15:00.077838+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/huge/release/15.1.0-1655-g4f16a0b/rpm/el8/BUILD/ceph-15.1.0-1655-g4f16a0b/src/common/ceph_time.h: 485: FAILED ceph_assert(z >= signedspan::zero())

/a/kchai-2020-03-03_13:02:55-rados-wip-kefu-testing-2020-03-03-1622-distro-basic-smithi/4820605/remote/*/log/ceph-mon.c*

Actions #24

Updated by Kefu Chai about 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #25

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #44486: nautilus: Nautilus: Random mon crashes in failed assertion at ceph::time_detail::signedspan added
Actions #26

Updated by Edwin Pers almost 4 years ago

For what it's worth we're still seeing it after upgrading debian to 10.3 and installing kernel "5.4.0-0.bpo.3-amd64 #1 SMP Debian 5.4.13-1~bpo10+1 (2020-02-07)" from buster-backports.

Actions #27

Updated by Nathan Cutler almost 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #28

Updated by Anonymous almost 4 years ago

So, we experienced this bug as well, so I investigated it myself, though I'm no timekeeping, ceph, or c++ expert, just an interested user/system engineer:

auto start = ceph::coarse_mono_clock::now();
get_store()->apply_transaction(t);
auto end = ceph::coarse_mono_clock::now();

I tried to find the definition for:

ceph::coarse_mono_clock::now();

and finally landed here (is this the correct code path?):

https://github.com/ceph/ceph/blob/25ac1528419371686740412616145703810a561f/src/common/ceph_time.h#L279

Here the "CLOCK_MONOTONIC_COARSE" gets set:

#if defined(CLOCK_MONOTONIC_COARSE)
// Linux systems have _COARSE clocks.
clock_gettime(CLOCK_MONOTONIC_COARSE, &ts);

Next, I looked up the documentation for clock_gettime:
https://linux.die.net/man/2/clock_gettime
We find:

CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific)
A faster but less precise version of CLOCK_MONOTONIC. Use when you need very fast, but not fine-grained timestamps.

What is afaik important is, that we also look at the definition of CLOCK_MONOTONIC, because CLOCK_MONOTONIC_COARSE builds on it:
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since some unspecified starting point. This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.

As we can see, in fact "CLOCK_MONOTONIC_COARSE" is not, as the name suggests, always monotonic, but instead can be influenced via adjtime and ntp.
So this bug might arise from ntp or other time setting sideffects.

Maybe instead this should be used (but you should measure/consider possible performance impacts):

CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3).

But I guess (again, not an expert), modern timesync daemons like chrony also set, by default at least on ubuntu 18.04. server, the rtc/hwclock.
so this might still not suffice.

also notice, that this specific call get's used many more times in the ceph codebase1. I don't know which impact it might have there.

[1]: https://github.com/ceph/ceph/search?q=ceph%3A%3Acoarse_mono_clock%3A%3Anow%28%29&unscoped_q=ceph%3A%3Acoarse_mono_clock%3A%3Anow%28%29

Actions

Also available in: Atom PDF