Project

General

Profile

Bug #52012

osd/scrub: src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>()

Added by Ronen Friedman over 2 years ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
backport_processed
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A new scrub request arriving to the replica after manual 'set noscrub' then 'unset' asserts as the replica is
still handling the aborted request.

Symptoms:

INFO:tasks.ceph.osd.4.smithi148.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4-667-gfc6905f2/rpm/el8/BUILD/ceph-16.2.4-667-gfc6905f2/src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>())
2021-08-01T06:10:30.493 INFO:tasks.ceph.osd.4.smithi148.stderr:
2021-08-01T06:10:30.493 INFO:tasks.ceph.osd.4.smithi148.stderr: ceph version 16.2.4-667-gfc6905f2 (fc6905f219e9e2e40f07a232823c5569746134fa) pacific (stable)
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 1: (ceph::
_ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55ab02496ed4]
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 2: ceph-osd(+0x56a0ee) [0x55ab024970ee]
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 3: ceph-osd(+0x9dfeef) [0x55ab0290ceef]
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 4: (PgScrubber::replica_scrub_op(boost::intrusive_ptr<OpRequest>)+0x4bf) [0x55ab028fd1cf]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 5: (PG::replica_scrub(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x62) [0x55ab0264caf2]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 6: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7bb) [0x55ab0271204b]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55ab0259b2a9]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 8: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55ab027f88e8]
_


Related issues

Copied to RADOS - Backport #53338: pacific: osd/scrub: src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>() Resolved

History

#1 Updated by Ronen Friedman over 2 years ago

  • Status changed from New to In Progress

Scenario:
- Primary reserves the replica
- Primary requests a scrub
- Replica in the process of creating the scrub-map, waiting for the backend
- Primary aborts the scrub (due to manual 'noscrub')
(A)- Primary releases the replica 'resource'
(B)- Primary reserves the replica
- Primary re-requests a scrub
- Replica asserts, as it is still in active scrubbing.

Note that the existing code assumes that the new request will be of a new interval (and the Replica would have noticed the interval change),
or - at least - a new epoch (which was supposed to be noticed). It is not.

#2 Updated by Ronen Friedman over 2 years ago

The fix is to use (A) & (B) above as a hint to the Replica, to discard all stale scrub processes.
In the suggested fix:
A token (index) is modified for each resource request ((B) above). The Replica tags all scrub-map creation
processes with this token, thus is able to identify stale events.

One missing piece: no easy way to tag the 'update' operations at the end of the scrub. But I do not see how this
can create a problem.

#3 Updated by Neha Ojha over 2 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#4 Updated by Neha Ojha over 2 years ago

  • Backport set to pacific

#5 Updated by Neha Ojha over 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 42684

#6 Updated by Neha Ojha over 2 years ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Backport Bot over 2 years ago

  • Copied to Backport #53338: pacific: osd/scrub: src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>() added

#9 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#10 Updated by Konstantin Shalygin 4 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF