Project

General

Profile

Actions

Bug #43150

closed

osd-scrub-snaps.sh fails

Added by Sage Weil over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/sage-2019-12-04_19:33:15-rados-wip-sage2-testing-2019-12-04-0856-distro-basic-smithi/4567061
/a/sage-2019-12-04_19:29:26-rados-wip-sage-testing-2019-12-04-0930-distro-basic-smithi/4566764

seems to be every (or almost every) rados suite run.


Related issues 1 (0 open1 closed)

Copied to RADOS - Backport #43852: nautilus: osd-scrub-snaps.sh failsResolvedNathan CutlerActions
Actions #1

Updated by David Zafman over 4 years ago

  • Assignee set to David Zafman
Actions #2

Updated by David Zafman over 4 years ago

During testing I saw this even though it isn't what happened in the teuthology runs. I think in all cases we have scrub request racing with newly started OSD which is still getting the PG set-up. The crash happened because the PG was in "unknown" state still.

-11> 2019-12-05T09:13:21.830-0800 7f008ffed700 10 osd.0 18 handle_fast_scrub scrub2([1.0]) v1
-10> 2019-12-05T09:13:21.830-0800 7f008ffed700 15 osd.0 18 enqueue_peering_evt 1.0 epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow
-9> 2019-12-05T09:13:21.830-0800 7f008ffed700 20 osd.0 op_wq(0) _enqueue OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18)
-8> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process 1.0 to_process <> waiting <> waiting_peering {}
-7> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18) queued
-6> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process 1.0 to_process &lt;OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18)&gt; waiting <> waiting_peering {}
-5> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18) pg 0x556390f2c000
-4> 2019-12-05T09:13:21.830-0800 7f0072777700 10 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] do_peering_event: epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow
-3> 2019-12-05T09:13:21.830-0800 7f0072777700 5 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] exit Reset 2019-12-05T09:13:21.832504-0800 2 0.000078
-2> 2019-12-05T09:13:21.830-0800 7f0072777700 5 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] enter Crashed
-1> 2019-12-05T09:13:21.882-0800 7f0072777700 -1 /home/dzafman/ceph/src/osd/PeeringState.cc: In function 'PeeringState::Crashed::Crashed(boost::statechart::state&lt;PeeringState::Crashed, PeeringState::PeeringMachine&gt;::my_context)' thread 7f0072777700 time 2019-12-05T09:13:21.832551-0800
/home/dzafman/ceph/src/osd/PeeringState.cc: 4206: ceph_abort_msg("we got a bad state machine event")
Actions #3

Updated by David Zafman over 4 years ago

  • Status changed from 12 to In Progress
  • Pull request ID set to 32039
Actions #4

Updated by Sage Weil over 4 years ago

  • Status changed from In Progress to Resolved
Actions #5

Updated by David Zafman about 4 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to nautilus, mimic, luminous
Actions #6

Updated by David Zafman about 4 years ago

  • Backport changed from nautilus, mimic, luminous to nautilus
Actions #7

Updated by Nathan Cutler about 4 years ago

Actions #9

Updated by Nathan Cutler about 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF