Bug #43150
osd-scrub-snaps.sh fails
% Done:
0%
Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/sage-2019-12-04_19:33:15-rados-wip-sage2-testing-2019-12-04-0856-distro-basic-smithi/4567061
/a/sage-2019-12-04_19:29:26-rados-wip-sage-testing-2019-12-04-0930-distro-basic-smithi/4566764
seems to be every (or almost every) rados suite run.
Related issues
History
#1 Updated by David Zafman almost 4 years ago
- Assignee set to David Zafman
#2 Updated by David Zafman almost 4 years ago
During testing I saw this even though it isn't what happened in the teuthology runs. I think in all cases we have scrub request racing with newly started OSD which is still getting the PG set-up. The crash happened because the PG was in "unknown" state still.
-11> 2019-12-05T09:13:21.830-0800 7f008ffed700 10 osd.0 18 handle_fast_scrub scrub2([1.0]) v1
-10> 2019-12-05T09:13:21.830-0800 7f008ffed700 15 osd.0 18 enqueue_peering_evt 1.0 epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow
-9> 2019-12-05T09:13:21.830-0800 7f008ffed700 20 osd.0 op_wq(0) _enqueue OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18)
-8> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process 1.0 to_process <> waiting <> waiting_peering {}
-7> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18) queued
-6> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process 1.0 to_process <OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18)> waiting <> waiting_peering {}
-5> 2019-12-05T09:13:21.830-0800 7f0072777700 20 osd.0 op_wq(0) _process OpSchedulerItem(1.0 PGPeeringEvent(epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow) prio 255 cost 10 e18) pg 0x556390f2c000
-4> 2019-12-05T09:13:21.830-0800 7f0072777700 10 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] do_peering_event: epoch_sent: 18 epoch_requested: 18 RequestScrub(shallow
-3> 2019-12-05T09:13:21.830-0800 7f0072777700 5 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] exit Reset 2019-12-05T09:13:21.832504-0800 2 0.000078
-2> 2019-12-05T09:13:21.830-0800 7f0072777700 5 osd.0 pg_epoch: 18 pg[1.0( v 18'56 (0'0,18'56] local-lis/les=9/10 n=36 ec=9/9 lis/c=9/9 les/c/f=10/10/0 sis=9) [0] r=0 lpr=18 crt=18'56 lcod 0'0 mlcod 0'0 unknown mbc={}] enter Crashed
-1> 2019-12-05T09:13:21.882-0800 7f0072777700 -1 /home/dzafman/ceph/src/osd/PeeringState.cc: In function 'PeeringState::Crashed::Crashed(boost::statechart::state<PeeringState::Crashed, PeeringState::PeeringMachine>::my_context)' thread 7f0072777700 time 2019-12-05T09:13:21.832551-0800
/home/dzafman/ceph/src/osd/PeeringState.cc: 4206: ceph_abort_msg("we got a bad state machine event")
#3 Updated by David Zafman almost 4 years ago
- Status changed from 12 to In Progress
- Pull request ID set to 32039
#4 Updated by Sage Weil almost 4 years ago
- Status changed from In Progress to Resolved
#5 Updated by David Zafman over 3 years ago
- Status changed from Resolved to Pending Backport
- Backport set to nautilus, mimic, luminous
#6 Updated by David Zafman over 3 years ago
- Backport changed from nautilus, mimic, luminous to nautilus
#7 Updated by Nathan Cutler over 3 years ago
- Copied to Backport #43852: nautilus: osd-scrub-snaps.sh fails added
#8 Updated by Yuri Weinstein over 3 years ago
#9 Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".