Project

General

Profile

Backport #14236

"OSDMonitor.cc: 2116: FAILED assert(0)" in rados-hammer-distro-basic-openstack

Added by Yuri Weinstein almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Target version:
Release:
hammer
Crash signature (v1):
Crash signature (v2):

Associated revisions

Revision 5264bc67 (diff)
Added by Joao Eduardo Luis almost 7 years ago

mon: OSDMonitor: do not assume a session exists in send_incremental()

We may not have an open session for a given osd. If we blindly assume we
do, we may end up trying to send incrementals we do not have to the osd.

And then we will crash.

This fixes a regression introduced by

171fee1b82d2675e364da7f96dfb9dd286d9b6e6

which is meant as a backport of

de43a02e06650a552f048dc8acd17f255126fed9

but so happens to intruduce a line that wasn't on the original patch. We
imagine it was meant to make the 's->osd_epoch' assignment work without
checking the session, as per the original patch, but the backporter must
have forgotten to also backport the assertion on the not-null session.
The unfortunate introduction of the check for a not-null session
triggered this regression.

The regression itself is due to enforcing that a session exists for the
osd we are sending the incrementals to. However, if we come via the
OSDMonitor::process_failures() path, that may very well not be the case,
as we are handling potentially-old MOSDFailure messages that may no
longer have an associated session. By enforcing the not-null session, we
don't check whether we have the requested versions (i.e., if
our_earliest_version <= requested_version), and thus we end up on the
path that assumes that we DO HAVE all the necessary versions -- when we
may not, thus finally asserting because we are reading blank
incremental versions.

Fixes: #14236

Signed-off-by: Joao Eduardo Luis <>

History

#1 Updated by Samuel Just almost 7 years ago

  • Assignee set to Samuel Just
  • Priority changed from Normal to Urgent

#2 Updated by Samuel Just almost 7 years ago

  • Assignee changed from Samuel Just to Joao Eduardo Luis

#3 Updated by Joao Eduardo Luis almost 7 years ago

  • Category set to Monitor
  • Status changed from New to Fix Under Review
  • Target version set to v0.94.6
  • Affected Versions v0.94.6 added

got a candidate fix in https://github.com/ceph/ceph/pull/7150

needs review & testing.

#4 Updated by Joao Eduardo Luis almost 7 years ago

  • Regression changed from No to Yes

#5 Updated by Loïc Dachary almost 7 years ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)
  • Target version deleted (v0.94.6)

original description

Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-01-04_21:00:02-rados-hammer-distro-basic-openstack/
Job: 59165
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-01-04_21:00:02-rados-hammer-distro-basic-openstack/59165/teuthology.log

2016-01-05T00:44:04.497 INFO:tasks.ceph.osd.5.target084154.stderr:2016-01-05 00:44:04.397876 7f2fdb652700 -1 osd.5 336 heartbeat_check: no reply from osd.1 since back 2016-01-05 00:43:39.522911 front 2016-01-05 00:43:39.522911 (cutoff 2016-01-05 00:43:44.397873)
2016-01-05T00:44:04.583 INFO:tasks.ceph.mon.b.target084154.stderr:mon/OSDMonitor.cc: In function 'MOSDMap* OSDMonitor::build_incremental(epoch_t, epoch_t)' thread 7f76ae096700 time 2016-01-05 00:44:04.479479
2016-01-05T00:44:04.583 INFO:tasks.ceph.mon.b.target084154.stderr:mon/OSDMonitor.cc: 2116: FAILED assert(0)
2016-01-05T00:44:04.592 INFO:tasks.ceph.mon.b.target084154.stderr: ceph version 0.94.5-178-g9739d4d (9739d4de49f8167866eda556b2f1581c068ec8a7)
2016-01-05T00:44:04.593 INFO:tasks.ceph.mon.b.target084154.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7dffeb]
2016-01-05T00:44:04.593 INFO:tasks.ceph.mon.b.target084154.stderr: 2: (OSDMonitor::build_incremental(unsigned int, unsigned int)+0x97e) [0x61cc4e]
2016-01-05T00:44:04.593 INFO:tasks.ceph.mon.b.target084154.stderr: 3: (OSDMonitor::send_incremental(PaxosServiceMessage*, unsigned int)+0x54c) [0x61d74c]
2016-01-05T00:44:04.593 INFO:tasks.ceph.mon.b.target084154.stderr: 4: (OSDMonitor::send_latest(PaxosServiceMessage*, unsigned int)+0x81) [0x61e951]
2016-01-05T00:44:04.593 INFO:tasks.ceph.mon.b.target084154.stderr: 5: (OSDMonitor::process_failures()+0x1ea) [0x61ecaa]
2016-01-05T00:44:04.594 INFO:tasks.ceph.mon.b.target084154.stderr: 6: (OSDMonitor::update_from_paxos(bool*)+0x12c4) [0x623c74]
2016-01-05T00:44:04.594 INFO:tasks.ceph.mon.b.target084154.stderr: 7: (PaxosService::refresh(bool*)+0x19a) [0x60476a]
2016-01-05T00:44:04.594 INFO:tasks.ceph.mon.b.target084154.stderr: 8: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b079b]
2016-01-05T00:44:04.594 INFO:tasks.ceph.mon.b.target084154.stderr: 9: (Paxos::do_refresh()+0x2e) [0x5eeece]
2016-01-05T00:44:04.594 INFO:tasks.ceph.mon.b.target084154.stderr: 10: (Paxos::commit_finish()+0x569) [0x5fc359]
2016-01-05T00:44:04.594 INFO:tasks.ceph.mon.b.target084154.stderr: 11: (C_Committed::finish(int)+0x2b) [0x6007cb]
2016-01-05T00:44:04.595 INFO:tasks.ceph.mon.b.target084154.stderr: 12: (Context::complete(int)+0x9) [0x5d51b9]
2016-01-05T00:44:04.595 INFO:tasks.ceph.mon.b.target084154.stderr: 13: (MonitorDBStore::C_DoTransaction::finish(int)+0x8c) [0x5ff8fc]
2016-01-05T00:44:04.595 INFO:tasks.ceph.mon.b.target084154.stderr: 14: (Context::complete(int)+0x9) [0x5d51b9]
2016-01-05T00:44:04.595 INFO:tasks.ceph.mon.b.target084154.stderr: 15: (Finisher::finisher_thread_entry()+0x158) [0x7172a8]
2016-01-05T00:44:04.596 INFO:tasks.ceph.mon.b.target084154.stderr: 16: (()+0x8182) [0x7f76b3194182]
2016-01-05T00:44:04.596 INFO:tasks.ceph.mon.b.target084154.stderr: 17: (clone()+0x6d) [0x7f76b16ff47d]
2016-01-05T00:44:04.596 INFO:tasks.ceph.mon.b.target084154.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#6 Updated by Loïc Dachary almost 7 years ago

Note: the rationale for this backport is to fix a regression introduced by a previous backport (details can be found in the commit message).

#7 Updated by Yuri Weinstein almost 7 years ago

  • Related to Bug #14306: "RadosModel.h: 854: FAILED assert(0)" in rados-hammer-distro-basic-openstack added

#8 Updated by Samuel Just almost 7 years ago

  • Related to deleted (Bug #14306: "RadosModel.h: 854: FAILED assert(0)" in rados-hammer-distro-basic-openstack)

#9 Updated by Nathan Cutler almost 7 years ago

So it's not a conventional backport, but rather a hammer-specific fix.

#10 Updated by Loïc Dachary almost 7 years ago

  • Status changed from Fix Under Review to In Progress

#11 Updated by Loïc Dachary almost 7 years ago

  • Status changed from In Progress to Resolved
  • Target version set to v0.94.6

Also available in: Atom PDF