Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed - RADOS - Ceph

Actions

Copy link

Bug #57940

closed

ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed

Added by Thomas Le Gentil over 1 year ago. Updated about 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.4

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi, I have this current crash:

I've experienced a disk failure in my ceph cluster.
I've replaced the disk, but now with the rebalancing / backfilling, one OSD crashes (osd.1).

When I set the 'nobackfill' flag, the osd does not crash and does crash right after the flag is removed.
The crash from the log looks like https://tracker.ceph.com/issues/56772

I've put the 'complete' log in attachment, here is the last part of the crash :

ceph version 17.2.4 (b26dd582fcc41389ea06191f19e88eed6eccea5b) quincy (stable) 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7ff3bc315140] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x565256aaffca] 5: /usr/bin/ceph-osd(+0xc2310e) [0x565256ab010e] 6: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x565256df05f3] 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x565256c9a94e] 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x565256d05543] 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x565256d0b42a] 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x565256b7a175] 11: (ceph:sd::scheduler:GRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x565256e34879] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x565256b9abc0] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x56525727dc1a] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5652572801f0] 15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff3bc309ea7] 16: clone()

    -1> 2022-10-25T21:05:48.188+0200 7ff39e1b7700 -1 ./src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7ff39e1b7700 time 2022-10-25T21:05:48.184867+0200
./src/osd/osd_types.cc: 5888: FAILED ceph_assert(clone_overlap.count(clone))

 ceph version 17.2.4 (b26dd582fcc41389ea06191f19e88eed6eccea5b) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x565256aaff70]
 2: /usr/bin/ceph-osd(+0xc2310e) [0x565256ab010e]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x565256df05f3]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x565256c9a94e]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x565256d05543]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x565256d0b42a]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x565256b7a175]
 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x565256e34879]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x565256b9abc0]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x56525727dc1a]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5652572801f0]
 12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff3bc309ea7]
 13: clone()

Files

issue-ceph.txt (926 KB) issue-ceph.txt

Thomas Le Gentil, 10/27/2022 06:07 PM

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Neha Ojha over 1 year ago

Project changed from Ceph to RADOS
Subject changed from ceph osd crashes when nobackfill OSD flag is removed to ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed
Description updated (diff)
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Status changed from New to Duplicate

Looks like a duplicate of 56772.

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Is duplicate of Bug #56772: crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone)) added

Actions

Copy link

Updated by Thomas Le Gentil over 1 year ago

the osd process does not crash if it is marked 'out'

Actions

Copy link

Updated by Thomas Le Gentil over 1 year ago

Thomas Le Gentil wrote:

the osd process does not crash if it is marked 'out'

Sorry, this is false. The OSD crashes also when marked 'out'

Actions

Copy link

Updated by Thomas Le Gentil over 1 year ago

I could avoid this crash by removing all pg for which ceph could not get the clone_bytes, except the one I was sure the data was good.

removing the pg 17.0:
set the osd down, then on each osd that contain the pg, do "ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-6 --pgid 17.0 --force"

keeping the one that is good :
set the osd down, then on the 'good osd' : "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1 --pgid 17.0 --op mark-complete --no-mon-config"

Then ceph backfills with new copies of this pg, avoiding to get the clone_bytes.

Actions

Copy link

Updated by Thomas Le Gentil over 1 year ago

Thomas Le Gentil wrote:

I could avoid this crash by removing all pg for which ceph could not get the clone_bytes, except the one I was sure the data was good.

removing the pg 17.0:
set the osd down, then on each osd that contain the pg, do "ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-6 --pgid 17.0 --force"

keeping the one that is good :
set the osd down, then on the 'good osd' : "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1 --pgid 17.0 --op mark-complete --no-mon-config"

Then ceph backfills with new copies of this pg, avoiding to get the clone_bytes.

Infact, this did not work for some reason :( The osd did not crash for several days, then did.

Actions

Copy link

Updated by Huy Nguyen about 1 year ago

Thomas Le Gentil wrote:

Infact, this did not work for some reason :( The osd did not crash for several days, then did.

Hi, I have the same issue as yours. What is the current status of your pool?
This bug can be temporarily fixed by manually backfilling:

On source OSD, export the pg:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-441 --no-mon-config --pgid 6.fcc --op export --file ./pg6fcc

On target OSD, delete the PG and import it from file:

ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-416 --pgid 6.fcc --force
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-416 --no-mon-config --op import --file ./pg6fcc

Actions

Copy link

Updated by Thomas Le Gentil about 1 year ago

Hi,
I've put the pool at size=1 and executed a data scraper for backup the most of data.
Then I've deleted the pool.

I tried manual backfilling, but the error was still present for me.

I had to move forward, so I cannot replicate this problem for now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #57940

ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed

Updated by Neha Ojha over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Thomas Le Gentil over 1 year ago

Updated by Thomas Le Gentil over 1 year ago

Updated by Thomas Le Gentil over 1 year ago

Updated by Thomas Le Gentil over 1 year ago

Updated by Huy Nguyen about 1 year ago

Updated by Thomas Le Gentil about 1 year ago