Project

General

Profile

Actions

Bug #50462

closed

OSDs crash in osd/osd_types.cc: FAILED ceph_assert(clone_overlap.count(clone))

Added by Martin Steinigen about 3 years ago. Updated over 1 year ago.

Status:
Won't Fix - EOL
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The issue started on luminous and it looked like an instance of https://tracker.ceph.com/issues/23030, so we decided to upgrade to latest luminous.
the initial ceph -s output ceph-s

initial osd errors:

assertion

That did not fix the issue, update to mimic and then to nautilus.
The upgrades went smooth, but the affected placement group is still broken.
“The symptom is that whenever an OSD touches the placement group for replication/recovery, it will fail with an assertion: {{ (assertion)
OSD 39

Mar 06 14:32:28 m62r1 ceph-osd24836: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:28 m62r1 ceph-osd24836: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:28 m62r1 ceph-osd24836: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:28 m62r1 ceph-osd24836: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:28 m62r1 ceph-osd24836: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:28 m62r1 ceph-osd24836: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:28 m62r1 ceph-osd24836: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:28 m62r1 ceph-osd24836: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:28 m62r1 ceph-osd24836: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:28 m62r1 ceph-osd24836: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:28 m62r1 ceph-osd24836: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:28 m62r1 ceph-osd24836: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:28 m62r1 ceph-osd24836: 12: (()+0x76ba) [0x7f8118dbc6ba]
Mar 06 14:32:28 m62r1 ceph-osd24836: 13: (clone()+0x6d) [0x7f81183c34dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: 0> 2021-03-06 14:32:27.878 7f80e2fee700 -1 ** Caught signal (Aborted) *
Mar 06 14:32:28 m62r1 ceph-osd24836: in thread 7f80e2fee700 thread_name:tp_osd_tp
Mar 06 14:32:28 m62r1 ceph-osd24836: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:28 m62r1 ceph-osd24836: 1: (()+0x11390) [0x7f8118dc6390]
Mar 06 14:32:28 m62r1 ceph-osd24836: 2: (gsignal()+0x38) [0x7f81182f1438]
Mar 06 14:32:28 m62r1 ceph-osd24836: 3: (abort()+0x16a) [0x7f81182f303a]
Mar 06 14:32:28 m62r1 ceph-osd24836: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:28 m62r1 ceph-osd24836: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:28 m62r1 ceph-osd24836: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:28 m62r1 ceph-osd24836: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:28 m62r1 ceph-osd24836: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:28 m62r1 ceph-osd24836: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:28 m62r1 ceph-osd24836: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:28 m62r1 ceph-osd24836: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:28 m62r1 ceph-osd24836: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:28 m62r1 ceph-osd24836: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:28 m62r1 ceph-osd24836: 15: (()+0x76ba) [0x7f8118dbc6ba]
Mar 06 14:32:28 m62r1 ceph-osd24836: 16: (clone()+0x6d) [0x7f81183c34dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:33:53 m62r1 systemd1: Stopped Ceph object storage daemon osd.39.
Mar 06 14:33:53 m62r1 systemd1: Starting Ceph object storage daemon osd.39...
Mar 06 14:33:54 m62r1 systemd1: Started Ceph object storage daemon osd.39.

OSD 17

Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fb1e2c14700 time 2021-03-06 14:32:02.411755
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2021-03-06 14:32:02.413 7fb1e2c14700 -1 /build/ceph-14.2.16/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fb1e2c14700 time 2021-03-06 14:32:02.411755
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: * Caught signal (Aborted) *
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2021-03-06 14:32:02.421 7fb1e2c14700 -1
Caught signal (Aborted)
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 0> 2021-03-06 14:32:02.421 7fb1e2c14700 -1
Caught signal (Aborted)
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 0> 2021-03-06 14:32:02.421 7fb1e2c14700 -1
Caught signal (Aborted) *
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:34:12 m65r1 systemd1: Stopped Ceph object storage daemon osd.17.

}}
this has forced to set norecovery on the cluster in order to at least have some availability (because this stops OSDs from crashing due to touching the pg during syncs), but that is no solution.


Files

ceph-bluestore-tool.fsck.out (8.85 KB) ceph-bluestore-tool.fsck.out Ana Aviles, 04/29/2021 09:46 AM
osd21.log (373 KB) osd21.log Ana Aviles, 04/29/2021 09:49 AM
Actions

Also available in: Atom PDF