Project

General

Profile

Actions

Bug #16639

closed

flush_pg_stats crashing in hammer integration tests

Added by Nathan Cutler almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During a recent hammer integration run, we are seeing ceph tell osd.? flush_pg_stats crash with "ceph-osd: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:162: T* boost::intrusive_ptr<T>::operator->() const [with T = Connection]: Assertion `px != 0' failed."

For example, it happened during the osd_backfill task, after the osd is killed and marked down, at this line:

https://github.com/ceph/ceph-qa-suite/blob/master/tasks/osd_backfill.py#L95

Link to the logs: http://pulpito.ceph.com/smithfarm-2016-07-02_01:06:27-rados-hammer-backports---basic-smithi/287997

Actions #1

Updated by Nathan Cutler almost 8 years ago

  • Description updated (diff)
Actions #2

Updated by Nathan Cutler almost 8 years ago

  • Subject changed from flush_pg_stats crashing with ENXIO in hammer integration tests to flush_pg_stats crashing in hammer integration tests
  • Description updated (diff)
Actions #3

Updated by Nathan Cutler almost 8 years ago

Pushed a copy of the hammer-backports branch that is exhibiting this bug, as "hammer-backports-20160708"

Actions #4

Updated by Samuel Just almost 8 years ago

5793b13492feecad399451e3a83836722b6e9abc might be related

Actions #5

Updated by Nathan Cutler almost 8 years ago

Scheduled full rados suite on "hammer" branch as a baseline:
http://pulpito.ceph.com/smithfarm-2016-07-08_14:34:18-rados-hammer---basic-smithi/

Actions #6

Updated by Kefu Chai almost 8 years ago

  • Assignee set to Kefu Chai
Actions #7

Updated by Nathan Cutler almost 8 years ago

The intrustive_ptr failures are reproducible: http://pulpito.ceph.com/smithfarm-2016-07-08_14:05:08-rados-hammer-backports---basic-smithi/ (rescheduled failed and dead jobs from the first run)

Actions #8

Updated by Nathan Cutler almost 8 years ago

The intrusive_ptr assertions do not appear in the hammer baseline run, so it's safe to assume they were introduced somewhere in the backports.

Actions #10

Updated by Kefu Chai almost 8 years ago

should drop 5793b13492feecad399451e3a83836722b6e9abc, see https://github.com/ceph/ceph/blob/hammer/src/osd/PG.cc#L6797

boost::statechart::result PG::RecoveryState::Stray::react(const MLogRec& logevt)
//...
  if (msg->info.last_backfill == hobject_t()) {
    if (!(msg->get_connection()->get_features() & CEPH_FEATURE_OSD_MIN_SIZE_RECOVERY)) {
      dout(10) << "Got logevt resetting backfill from peer featuring bug" 
           << " 10780, setting msg->info.last_epoch_started to logevt.query_epoch," 
           << " which is the activation epoch." << dendl;
      msg->info.last_epoch_started = msg->get_query_epoch();
    }

we are dereferencing msg->connection to workaround a known bug.
Actions #11

Updated by Kefu Chai almost 8 years ago

  • Status changed from New to Resolved

i dropped 5793b13492feecad399451e3a83836722b6e9abc in https://github.com/ceph/ceph/pull/9090, so closing this issue as "resolved".

Actions

Also available in: Atom PDF