Bug #16639
closedflush_pg_stats crashing in hammer integration tests
0%
Description
During a recent hammer integration run, we are seeing ceph tell osd.? flush_pg_stats
crash with "ceph-osd: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:162: T* boost::intrusive_ptr<T>::operator->() const [with T = Connection]: Assertion `px != 0' failed."
For example, it happened during the osd_backfill task, after the osd is killed and marked down, at this line:
https://github.com/ceph/ceph-qa-suite/blob/master/tasks/osd_backfill.py#L95
Link to the logs: http://pulpito.ceph.com/smithfarm-2016-07-02_01:06:27-rados-hammer-backports---basic-smithi/287997
Updated by Nathan Cutler almost 8 years ago
- Subject changed from flush_pg_stats crashing with ENXIO in hammer integration tests to flush_pg_stats crashing in hammer integration tests
- Description updated (diff)
Updated by Nathan Cutler almost 8 years ago
Pushed a copy of the hammer-backports branch that is exhibiting this bug, as "hammer-backports-20160708"
Updated by Samuel Just almost 8 years ago
5793b13492feecad399451e3a83836722b6e9abc might be related
Updated by Nathan Cutler almost 8 years ago
Scheduled full rados suite on "hammer" branch as a baseline:
http://pulpito.ceph.com/smithfarm-2016-07-08_14:34:18-rados-hammer---basic-smithi/
Updated by Nathan Cutler almost 8 years ago
The intrustive_ptr failures are reproducible: http://pulpito.ceph.com/smithfarm-2016-07-08_14:05:08-rados-hammer-backports---basic-smithi/ (rescheduled failed and dead jobs from the first run)
Updated by Nathan Cutler almost 8 years ago
The intrusive_ptr assertions do not appear in the hammer baseline run, so it's safe to assume they were introduced somewhere in the backports.
Updated by Kefu Chai almost 8 years ago
- i reverted the 5793b13492feecad399451e3a83836722b6e9abc and reran the test at http://pulpito.ceph.com/kchai-2016-07-13_03:10:12-rados-wip-16639-hammer---basic-mira/, passed
- i repeated the test of http://pulpito.ceph.com/smithfarm-2016-07-02_01:06:27-rados-hammer-backports---basic-smithi/287997 at http://pulpito.ceph.com/kchai-2016-07-13_04:58:48-rados-hammer-backports---basic-mira/, failed.
Updated by Kefu Chai almost 8 years ago
should drop 5793b13492feecad399451e3a83836722b6e9abc, see https://github.com/ceph/ceph/blob/hammer/src/osd/PG.cc#L6797
boost::statechart::result PG::RecoveryState::Stray::react(const MLogRec& logevt)
//...
if (msg->info.last_backfill == hobject_t()) {
if (!(msg->get_connection()->get_features() & CEPH_FEATURE_OSD_MIN_SIZE_RECOVERY)) {
dout(10) << "Got logevt resetting backfill from peer featuring bug"
<< " 10780, setting msg->info.last_epoch_started to logevt.query_epoch,"
<< " which is the activation epoch." << dendl;
msg->info.last_epoch_started = msg->get_query_epoch();
}
we are dereferencing
msg->connection
to workaround a known bug.Updated by Kefu Chai almost 8 years ago
- Status changed from New to Resolved
i dropped 5793b13492feecad399451e3a83836722b6e9abc in https://github.com/ceph/ceph/pull/9090, so closing this issue as "resolved".