Project

General

Profile

Actions

Bug #15408

closed

"osd/PG.cc: 1892: FAILED assert(waiting_for_peered.empty())" in upgrade:hammer-hammer-distro-basic-openstack

Added by Yuri Weinstein about 8 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-04-06_20:05:02-upgrade:hammer-hammer-distro-basic-openstack/
Job: 30258
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-04-06_20:05:02-upgrade:hammer-hammer-distro-basic-openstack/30258/teuthology.log

2016-04-06T22:09:29.988 INFO:teuthology.orchestra.run.target086019.stderr:dumped all in format json
2016-04-06T22:09:30.814 INFO:teuthology.orchestra.run.target086019:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json'
2016-04-06T22:09:31.295 INFO:teuthology.orchestra.run.target086019:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 158.69.86.189:6790 mon_status'
2016-04-06T22:09:32.021 INFO:tasks.ceph.osd.5.target086189.stderr:osd/PG.cc: In function 'void PG::replay_queued_ops()' thread 7f2da141d700 time 2016-04-06 22:09:31.915151
2016-04-06T22:09:32.022 INFO:tasks.ceph.osd.5.target086189.stderr:osd/PG.cc: 1892: FAILED assert(waiting_for_peered.empty())
2016-04-06T22:09:32.212 INFO:tasks.ceph.osd.5.target086189.stderr: ceph version 0.94.6-254-ge219e85 (e219e85be00088eecde7b1f29d7699493a79bc4d)
2016-04-06T22:09:32.213 INFO:tasks.ceph.osd.5.target086189.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb1c6b]
2016-04-06T22:09:32.213 INFO:tasks.ceph.osd.5.target086189.stderr: 2: (PG::replay_queued_ops()+0x432) [0x7cc962]
2016-04-06T22:09:32.214 INFO:tasks.ceph.osd.5.target086189.stderr: 3: (OSD::check_replay_queue()+0x3f1) [0x674c61]
2016-04-06T22:09:32.214 INFO:tasks.ceph.osd.5.target086189.stderr: 4: (OSD::tick()+0x60c) [0x6b3d1c]
2016-04-06T22:09:32.214 INFO:tasks.ceph.osd.5.target086189.stderr: 5: (Context::complete(int)+0x9) [0x6c2a49]
2016-04-06T22:09:32.214 INFO:tasks.ceph.osd.5.target086189.stderr: 6: (SafeTimer::timer_thread()+0xec) [0xb9ad6c]
2016-04-06T22:09:32.215 INFO:tasks.ceph.osd.5.target086189.stderr: 7: (SafeTimerThread::entry()+0xd) [0xb9bd0d]
2016-04-06T22:09:32.215 INFO:tasks.ceph.osd.5.target086189.stderr: 8: (()+0x8182) [0x7f2daa145182]
2016-04-06T22:09:32.215 INFO:tasks.ceph.osd.5.target086189.stderr: 9: (clone()+0x6d) [0x7f2da86b047d]
2016-04-06T22:09:32.216 INFO:tasks.ceph.osd.5.target086189.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-04-06T22:09:32.217 INFO:teuthology.orchestra.run.target086019.stderr:dumped all in format json
Actions #1

Updated by Loïc Dachary about 8 years ago

It fails when upgrading from v0.94.2 to the latest hammer. The part of the code where it fails has been modified by http://tracker.ceph.com/projects/ceph/repository/revisions/9f3aebee16e256888b149fa770df845787b06b6e/diff in v0.94.6

Actions #2

Updated by Sage Weil about 8 years ago

  • Status changed from New to 12

I don't think this is hammer specific:

- boost::statechart::result PG::RecoveryState::Active::react(const AllReplicasActivated &evt) (and elsewhere) guard the requeue of waiting_for_peered:

if (pg->flushes_in_progress  0) {
pg->requeue_ops(pg->waiting_for_peered);
}

- but the replay queue blindly takes pgs that have expired and tries to do the queued events:

if ((pg->is_active() || pg->is_activating()) &&
pg->is_replay() &&
pg->is_primary() &&
pg->replay_until p->second) {
pg->replay_queued_ops();

(which ignores flushes_in_progress), and requeue_queued_ops will

if (is_active()) {
requeue_ops(replay);
requeue_ops(waiting_for_active);
assert(waiting_for_peered.empty());

i.e., active state is not linked to whether there are flushes, and we are asserting there aren't. we could delay replay_queued_ops until we are flushed, I suppose?

Honestly, I'd rather rip replay out entirely.

Actions #3

Updated by Samuel Just over 7 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF