Bug #43150: osd-scrub-snaps.sh fails - RADOS - Ceph

Actions

Copy link

Bug #43150

closed

osd-scrub-snaps.sh fails

Added by Sage Weil over 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

Tags:

Backport:

nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

32039

Crash signature (v1):

Crash signature (v2):

Description

/a/sage-2019-12-04_19:33:15-rados-wip-sage2-testing-2019-12-04-0856-distro-basic-smithi/4567061
/a/sage-2019-12-04_19:29:26-rados-wip-sage-testing-2019-12-04-0930-distro-basic-smithi/4566764

seems to be every (or almost every) rados suite run.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by David Zafman over 4 years ago

Assignee set to David Zafman

Actions

Copy link

Updated by David Zafman over 4 years ago

During testing I saw this even though it isn't what happened in the teuthology runs. I think in all cases we have scrub request racing with newly started OSD which is still getting the PG set-up. The crash happened because the PG was in "unknown" state still.

-11> 2019-12-05T09:13:21.830-0800 7f008ffed700 10 osd.0 18 handle_fast_scrub scrub2([1.0]) v1
   -10> 2019-12-05T09:13:21.830-0800 7f008ffed700 15 osd.0 18 enqueue_peering_evt 1.0 epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow
    -9> 2019-12-05T09:13:21.830-0800 7f008ffed700 20 osd.0 op_wq(0) _enqueue OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18)
    -8> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process 1.0 to_process <> waiting <> waiting_peering {}
    -7> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18) queued
    -6> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process 1.0 to_process &lt;OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18)&gt; waiting <> waiting_peering {}
    -5> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18) pg 0x556390f2c000
    -4> 2019-12-05T09:13:21.830-0800 7f0072777700 10 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] do_peering_event: epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow
    -3> 2019-12-05T09:13:21.830-0800 7f0072777700  5 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] exit Reset 2019-12-05T09:13:21.832504-0800 2 0.000078
    -2> 2019-12-05T09:13:21.830-0800 7f0072777700  5 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] enter Crashed
    -1> 2019-12-05T09:13:21.882-0800 7f0072777700 -1 /home/dzafman/ceph/src/osd/PeeringState.cc: In function 'PeeringState::Crashed::Crashed(boost::statechart::state&lt;PeeringState::Crashed, PeeringState::PeeringMachine&gt;::my_context)' thread 7f0072777700 time 2019-12-05T09:13:21.832551-0800
/home/dzafman/ceph/src/osd/PeeringState.cc: 4206: ceph_abort_msg("we got a bad state machine event")

Actions

Copy link

Updated by David Zafman over 4 years ago

Status changed from 12 to In Progress
Pull request ID set to 32039

Actions

Copy link

Updated by Sage Weil over 4 years ago

Status changed from In Progress to Resolved

Actions

Copy link

Updated by David Zafman about 4 years ago

Status changed from Resolved to Pending Backport
Backport set to nautilus, mimic, luminous

Actions

Copy link

Updated by David Zafman about 4 years ago

Backport changed from nautilus, mimic, luminous to nautilus

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Copied to Backport #43852: nautilus: osd-scrub-snaps.sh fails added

Actions

Copy link

Updated by Yuri Weinstein about 4 years ago

https://github.com/ceph/ceph/pull/33274 merged

Actions

Copy link

Updated by Nathan Cutler about 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #43150

osd-scrub-snaps.sh fails

Updated by David Zafman over 4 years ago

Updated by David Zafman over 4 years ago

Updated by David Zafman over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by David Zafman about 4 years ago

Updated by David Zafman about 4 years ago

Updated by Nathan Cutler about 4 years ago

Updated by Yuri Weinstein about 4 years ago

Updated by Nathan Cutler about 4 years ago