Bug #52012
osd/scrub: src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>()
100%
Description
A new scrub request arriving to the replica after manual 'set noscrub' then 'unset' asserts as the replica is
still handling the aborted request.
Symptoms:
INFO:tasks.ceph.osd.4.smithi148.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4-667-gfc6905f2/rpm/el8/BUILD/ceph-16.2.4-667-gfc6905f2/src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>())
2021-08-01T06:10:30.493 INFO:tasks.ceph.osd.4.smithi148.stderr:
2021-08-01T06:10:30.493 INFO:tasks.ceph.osd.4.smithi148.stderr: ceph version 16.2.4-667-gfc6905f2 (fc6905f219e9e2e40f07a232823c5569746134fa) pacific (stable)
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 1: (ceph::_ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55ab02496ed4]
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 2: ceph-osd(+0x56a0ee) [0x55ab024970ee]
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 3: ceph-osd(+0x9dfeef) [0x55ab0290ceef]
2021-08-01T06:10:30.494 INFO:tasks.ceph.osd.4.smithi148.stderr: 4: (PgScrubber::replica_scrub_op(boost::intrusive_ptr<OpRequest>)+0x4bf) [0x55ab028fd1cf]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 5: (PG::replica_scrub(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x62) [0x55ab0264caf2]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 6: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7bb) [0x55ab0271204b]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55ab0259b2a9]
2021-08-01T06:10:30.495 INFO:tasks.ceph.osd.4.smithi148.stderr: 8: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55ab027f88e8]
_
Related issues
History
#1 Updated by Ronen Friedman over 2 years ago
- Status changed from New to In Progress
Scenario:
- Primary reserves the replica
- Primary requests a scrub
- Replica in the process of creating the scrub-map, waiting for the backend
- Primary aborts the scrub (due to manual 'noscrub')
(A)- Primary releases the replica 'resource'
(B)- Primary reserves the replica
- Primary re-requests a scrub
- Replica asserts, as it is still in active scrubbing.
Note that the existing code assumes that the new request will be of a new interval (and the Replica would have noticed the interval change),
or - at least - a new epoch (which was supposed to be noticed). It is not.
#2 Updated by Ronen Friedman over 2 years ago
The fix is to use (A) & (B) above as a hint to the Replica, to discard all stale scrub processes.
In the suggested fix:
A token (index) is modified for each resource request ((B) above). The Replica tags all scrub-map creation
processes with this token, thus is able to identify stale events.
One missing piece: no easy way to tag the 'update' operations at the end of the scrub. But I do not see how this
can create a problem.
#3 Updated by Neha Ojha over 2 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
#4 Updated by Neha Ojha over 2 years ago
- Backport set to pacific
#5 Updated by Neha Ojha over 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 42684
#6 Updated by Neha Ojha over 2 years ago
- Status changed from Fix Under Review to Pending Backport
#7 Updated by Backport Bot over 2 years ago
- Copied to Backport #53338: pacific: osd/scrub: src/osd/scrub_machine.cc: 55: FAILED ceph_assert(state_cast<const NotActive*>() added
#8 Updated by Yuri Weinstein almost 2 years ago
#9 Updated by Backport Bot over 1 year ago
- Tags set to backport_processed
#10 Updated by Konstantin Shalygin 4 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100