Project

General

Profile

Actions

Bug #57940

closed

ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed

Added by Thomas Le Gentil over 1 year ago. Updated about 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi, I have this current crash:

I've experienced a disk failure in my ceph cluster.
I've replaced the disk, but now with the rebalancing / backfilling, one OSD crashes (osd.1).

When I set the 'nobackfill' flag, the osd does not crash and does crash right after the flag is removed.
The crash from the log looks like https://tracker.ceph.com/issues/56772

I've put the 'complete' log in attachment, here is the last part of the crash :

ceph version 17.2.4 (b26dd582fcc41389ea06191f19e88eed6eccea5b) quincy (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7ff3bc315140]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x565256aaffca]
5: /usr/bin/ceph-osd(+0xc2310e) [0x565256ab010e]
6: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x565256df05f3]
7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x565256c9a94e]
8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x565256d05543]
9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x565256d0b42a]
10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x565256b7a175]
11: (ceph:sd::scheduler:GRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x565256e34879]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x565256b9abc0]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x56525727dc1a]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5652572801f0]
15: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff3bc309ea7]
16: clone()

    -1> 2022-10-25T21:05:48.188+0200 7ff39e1b7700 -1 ./src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7ff39e1b7700 time 2022-10-25T21:05:48.184867+0200
./src/osd/osd_types.cc: 5888: FAILED ceph_assert(clone_overlap.count(clone))

 ceph version 17.2.4 (b26dd582fcc41389ea06191f19e88eed6eccea5b) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x565256aaff70]
 2: /usr/bin/ceph-osd(+0xc2310e) [0x565256ab010e]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xe3) [0x565256df05f3]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x23e) [0x565256c9a94e]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x19f3) [0x565256d05543]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xf2a) [0x565256d0b42a]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x295) [0x565256b7a175]
 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x565256e34879]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xad0) [0x565256b9abc0]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x41a) [0x56525727dc1a]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5652572801f0]
 12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff3bc309ea7]
 13: clone()

Files

issue-ceph.txt (926 KB) issue-ceph.txt Thomas Le Gentil, 10/27/2022 06:07 PM

Related issues 1 (1 open0 closed)

Is duplicate of RADOS - Bug #56772: crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone))New

Actions
Actions #1

Updated by Neha Ojha over 1 year ago

  • Project changed from Ceph to RADOS
  • Subject changed from ceph osd crashes when nobackfill OSD flag is removed to ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill OSD flag is removed
  • Description updated (diff)
  • Category deleted (OSD)
Actions #2

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Duplicate

Looks like a duplicate of 56772.

Actions #3

Updated by Radoslaw Zarzynski over 1 year ago

  • Is duplicate of Bug #56772: crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(clone)) added
Actions #4

Updated by Thomas Le Gentil over 1 year ago

the osd process does not crash if it is marked 'out'

Actions #5

Updated by Thomas Le Gentil over 1 year ago

Thomas Le Gentil wrote:

the osd process does not crash if it is marked 'out'

Sorry, this is false. The OSD crashes also when marked 'out'

Actions #6

Updated by Thomas Le Gentil over 1 year ago

I could avoid this crash by removing all pg for which ceph could not get the clone_bytes, except the one I was sure the data was good.

removing the pg 17.0:
set the osd down, then on each osd that contain the pg, do "ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-6 --pgid 17.0 --force"

keeping the one that is good :
set the osd down, then on the 'good osd' : "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1 --pgid 17.0 --op mark-complete --no-mon-config"

Then ceph backfills with new copies of this pg, avoiding to get the clone_bytes.

Actions #7

Updated by Thomas Le Gentil over 1 year ago

Thomas Le Gentil wrote:

I could avoid this crash by removing all pg for which ceph could not get the clone_bytes, except the one I was sure the data was good.

removing the pg 17.0:
set the osd down, then on each osd that contain the pg, do "ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-6 --pgid 17.0 --force"

keeping the one that is good :
set the osd down, then on the 'good osd' : "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1 --pgid 17.0 --op mark-complete --no-mon-config"

Then ceph backfills with new copies of this pg, avoiding to get the clone_bytes.

Infact, this did not work for some reason :( The osd did not crash for several days, then did.

Actions #8

Updated by Huy Nguyen about 1 year ago

Thomas Le Gentil wrote:

Infact, this did not work for some reason :( The osd did not crash for several days, then did.

Hi, I have the same issue as yours. What is the current status of your pool? 
This bug can be temporarily fixed by manually backfilling:

On source OSD, export the pg:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-441 --no-mon-config --pgid 6.fcc --op export --file ./pg6fcc

On target OSD, delete the PG and import it from file:

ceph-objectstore-tool --op remove --data-path /var/lib/ceph/osd/ceph-416 --pgid 6.fcc --force
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-416 --no-mon-config --op import --file ./pg6fcc

Actions #9

Updated by Thomas Le Gentil about 1 year ago

Hi,
I've put the pool at size=1 and executed a data scraper for backup the most of data.
Then I've deleted the pool.

I tried manual backfilling, but the error was still present for me.

I had to move forward, so I cannot replicate this problem for now.

Actions

Also available in: Atom PDF