Project

General

Profile

Actions

Bug #50462

closed

OSDs crash in osd/osd_types.cc: FAILED ceph_assert(clone_overlap.count(clone))

Added by Martin Steinigen about 3 years ago. Updated over 1 year ago.

Status:
Won't Fix - EOL
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-disk
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The issue started on luminous and it looked like an instance of https://tracker.ceph.com/issues/23030, so we decided to upgrade to latest luminous.
the initial ceph -s output ceph-s

initial osd errors:

assertion

That did not fix the issue, update to mimic and then to nautilus.
The upgrades went smooth, but the affected placement group is still broken.
“The symptom is that whenever an OSD touches the placement group for replication/recovery, it will fail with an assertion: {{ (assertion)
OSD 39

Mar 06 14:32:28 m62r1 ceph-osd24836: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:28 m62r1 ceph-osd24836: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:28 m62r1 ceph-osd24836: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:28 m62r1 ceph-osd24836: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:28 m62r1 ceph-osd24836: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:28 m62r1 ceph-osd24836: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:28 m62r1 ceph-osd24836: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:28 m62r1 ceph-osd24836: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:28 m62r1 ceph-osd24836: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:28 m62r1 ceph-osd24836: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:28 m62r1 ceph-osd24836: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:28 m62r1 ceph-osd24836: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:28 m62r1 ceph-osd24836: 12: (()+0x76ba) [0x7f8118dbc6ba]
Mar 06 14:32:28 m62r1 ceph-osd24836: 13: (clone()+0x6d) [0x7f81183c34dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: 0> 2021-03-06 14:32:27.878 7f80e2fee700 -1 ** Caught signal (Aborted) *
Mar 06 14:32:28 m62r1 ceph-osd24836: in thread 7f80e2fee700 thread_name:tp_osd_tp
Mar 06 14:32:28 m62r1 ceph-osd24836: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:28 m62r1 ceph-osd24836: 1: (()+0x11390) [0x7f8118dc6390]
Mar 06 14:32:28 m62r1 ceph-osd24836: 2: (gsignal()+0x38) [0x7f81182f1438]
Mar 06 14:32:28 m62r1 ceph-osd24836: 3: (abort()+0x16a) [0x7f81182f303a]
Mar 06 14:32:28 m62r1 ceph-osd24836: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:28 m62r1 ceph-osd24836: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:28 m62r1 ceph-osd24836: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:28 m62r1 ceph-osd24836: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:28 m62r1 ceph-osd24836: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:28 m62r1 ceph-osd24836: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:28 m62r1 ceph-osd24836: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:28 m62r1 ceph-osd24836: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:28 m62r1 ceph-osd24836: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:28 m62r1 ceph-osd24836: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:28 m62r1 ceph-osd24836: 15: (()+0x76ba) [0x7f8118dbc6ba]
Mar 06 14:32:28 m62r1 ceph-osd24836: 16: (clone()+0x6d) [0x7f81183c34dd]
Mar 06 14:32:28 m62r1 ceph-osd24836: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:33:53 m62r1 systemd1: Stopped Ceph object storage daemon osd.39.
Mar 06 14:33:53 m62r1 systemd1: Starting Ceph object storage daemon osd.39...
Mar 06 14:33:54 m62r1 systemd1: Started Ceph object storage daemon osd.39.

OSD 17

Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fb1e2c14700 time 2021-03-06 14:32:02.411755
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2021-03-06 14:32:02.413 7fb1e2c14700 -1 /build/ceph-14.2.16/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fb1e2c14700 time 2021-03-06 14:32:02.411755
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: * Caught signal (Aborted) *
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2021-03-06 14:32:02.421 7fb1e2c14700 -1
Caught signal (Aborted)
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 0> 2021-03-06 14:32:02.421 7fb1e2c14700 -1
Caught signal (Aborted)
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:32:02 m65r1 ceph-osd22262: /build/ceph-14.2.16/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x846d7e]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: 0> 2021-03-06 14:32:02.421 7fb1e2c14700 -1
Caught signal (Aborted) *
Mar 06 14:32:02 m65r1 ceph-osd22262: in thread 7fb1e2c14700 thread_name:tp_osd_tp
Mar 06 14:32:02 m65r1 ceph-osd22262: ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
Mar 06 14:32:02 m65r1 ceph-osd22262: 1: (()+0x11390) [0x7fb209b47390]
Mar 06 14:32:02 m65r1 ceph-osd22262: 2: (gsignal()+0x38) [0x7fb209072428]
Mar 06 14:32:02 m65r1 ceph-osd22262: 3: (abort()+0x16a) [0x7fb20907402a]
Mar 06 14:32:02 m65r1 ceph-osd22262: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x846dcf]
Mar 06 14:32:02 m65r1 ceph-osd22262: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x846f59]
Mar 06 14:32:02 m65r1 ceph-osd22262: 6: (SnapSet::get_clone_bytes(snapid_t) const+0xd2) [0xb2fad2]
Mar 06 14:32:02 m65r1 ceph-osd22262: 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2d0) [0xa50ff0]
Mar 06 14:32:02 m65r1 ceph-osd22262: 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x163b) [0xa8493b]
Mar 06 14:32:02 m65r1 ceph-osd22262: 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xec2) [0xa88d02]
Mar 06 14:32:02 m65r1 ceph-osd22262: 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x366) [0x8c9986]
Mar 06 14:32:02 m65r1 ceph-osd22262: 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0xb83599]
Mar 06 14:32:02 m65r1 ceph-osd22262: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x8e81dd]
Mar 06 14:32:02 m65r1 ceph-osd22262: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xf014ac]
Mar 06 14:32:02 m65r1 ceph-osd22262: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xf04670]
Mar 06 14:32:02 m65r1 ceph-osd22262: 15: (()+0x76ba) [0x7fb209b3d6ba]
Mar 06 14:32:02 m65r1 ceph-osd22262: 16: (clone()+0x6d) [0x7fb20914441d]
Mar 06 14:32:02 m65r1 ceph-osd22262: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Mar 06 14:34:12 m65r1 systemd1: Stopped Ceph object storage daemon osd.17.

}}
this has forced to set norecovery on the cluster in order to at least have some availability (because this stops OSDs from crashing due to touching the pg during syncs), but that is no solution.


Files

ceph-bluestore-tool.fsck.out (8.85 KB) ceph-bluestore-tool.fsck.out Ana Aviles, 04/29/2021 09:46 AM
osd21.log (373 KB) osd21.log Ana Aviles, 04/29/2021 09:49 AM
Actions #1

Updated by Martin Steinigen about 3 years ago

Sorry for formatting that bad

The issue started on luminous and it looked like an instance of https://tracker.ceph.com/issues/23030, so we decided to upgrade to latest luminous.
the initial ceph -s output

ceph-s

initial osd errors:

assertion

That did not fix the issue, update to mimic and then to nautilus.
The upgrades went smooth, but the affected placement group is still broken.
“The symptom is that whenever an OSD touches the placement group for replication/recovery, it will fail with an assertion:

assertion


this has forced to set norecovery on the cluster in order to at least have some availability (because this stops OSDs from crashing due to touching the pg during syncs), but that is no solution.

Actions #2

Updated by Martin Steinigen about 3 years ago

finaly the correct format

The issue started on luminous and it looked like an instance of https://tracker.ceph.com/issues/23030, so we decided to upgrade to latest luminous.
the initial ceph -s output

ceph-s

initial osd errors:

assertion

That did not fix the issue, update to mimic and then to nautilus.
The upgrades went smooth, but the affected placement group is still broken.
“The symptom is that whenever an OSD touches the placement group for replication/recovery, it will fail with an assertion:

assertion

this has forced to set norecovery on the cluster in order to at least have some availability (because this stops OSDs from crashing due to touching the pg during syncs), but that is no solution.

Updated by Ana Aviles almost 3 years ago

We run into the same assert over and over on one OSD. We were upgrading from luminous to nautilus ceph version 14.2.20 and had a mix of filestore and bluestore OSDs. The problems were always on bluestore OSDs.

During the upgrade two OSDs were crashing for different reasons causing some PGs to be down. OSD wouldn't come up stopping on this assert. In the end we had to force_recreate the PGs to allow client I/O to continue working and tried to restore as much data as we could afterwards. That was a difficult phase because exporting the inactive PGs with `ceph-objectstore-tool` would fail with ` export_files error -5 ` and wouldn't export the whole PG.
Trying to fix the OSD consistency with the `ceph-bluestore-tool` gave a core dump, I attach the output.

I add here logs we gathered for the assert in case they are of any help. I attach the log with default debug level here, and I uploaded the log with debug level 20 with this id: b11325b3-0f3c-4af5-b349-64fe912bf984

Actions #4

Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Priority changed from Normal to High
Actions #5

Updated by Igor Fedotov almost 3 years ago

Just to note:
IMO ceph-bluestore-tool crash is caused by a bag in AvlAllocator and is a duplicate of https://tracker.ceph.com/issues/50555. Hence going to be fixed in 14.2.22
But it looks completely irrelevant to the original crash reported in this ticket.

Actions #6

Updated by Neha Ojha almost 3 years ago

  • Subject changed from OSDs crashing while PG operations to OSDs crash in osd/osd_types.cc: FAILED ceph_assert(clone_overlap.count(clone))
Actions #7

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Won't Fix - EOL

Please feel free to reopen if you see the issue in a recent version of Ceph.

Actions #8

Updated by Justin Mammarella almost 2 years ago

We are seeing this bug in Nautilus 14.2.15 to 14.2.22 replicated pool.

Two of our osds are stuck in a crash loop trying to backfill. We have disabled backfills for now and are looking at removing the offending object from the pool. If that fails we will try and manually backfill to the target osd (479).

    -6> 2022-05-23 11:41:33.251 7f15245c1700  5 osd.262 pg_epoch: 685308 pg[6.244a( v 685307'35436378 (685280'35433366,685307'35436378] local-lis/les=685306/685307 n=14244 ec=418948/2707 lis/c 685306/683632 les/c/f 685307/683636/0 685305/685306/685306) [262,146,479]/[262,146] backfill=[479] r=0 lpr=685306 pi=[683632,685306)/3 rops=1 crt=685307'35436378 lcod 685307'35436377 mlcod 685307'35436377 active+undersized+degraded+remapped+backfilling mbc={255={}} trimq=[5ce7~1,5e9b~1,5ef5~1,6039~1,8509~1,9f87~1,9f8f~1,9f91~1,9f97~1,9f99~1,9f9b~1,9f9d~1]] backfill_pos is 6:52245a01:::rbd_data.8a845579e2a9e3.0000000000b06c49:head
 
    -1> 2022-05-23 11:41:33.327 7f15245c1700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f15245c1700 time 2022-05-23 11:41:33.325202
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/osd_types.cc: 5450: FAILED ceph_assert(clone_overlap.count(clone))
 
 ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55a68b6f9393]
 2: (()+0x4da55b) [0x55a68b6f955b]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55a68ba15b62]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x28c) [0x55a68b947b2c]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xf65) [0x55a68b976a05]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x114c) [0x55a68b97a82c]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2ff) [0x55a68b7d9fcf]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55a68ba69fd9]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55a68b7f5d1f]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a68bdb4006]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a68bdb6b20]
 12: (()+0x7ea5) [0x7f1546778ea5]
 13: (clone()+0x6d) [0x7f154563b9fd]
 
     0> 2022-05-23 11:41:33.331 7f15245c1700 -1 ** Caught signal (Aborted) *
 in thread 7f15245c1700 thread_name:tp_osd_tp
 
 ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
 1: (()+0xf630) [0x7f1546780630]
 2: (gsignal()+0x37) [0x7f15455733d7]
 3: (abort()+0x148) [0x7f1545574ac8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55a68b6f93e2]
 5: (()+0x4da55b) [0x55a68b6f955b]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55a68ba15b62]
 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x28c) [0x55a68b947b2c]
 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0xf65) [0x55a68b976a05]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x114c) [0x55a68b97a82c]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2ff) [0x55a68b7d9fcf]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x55a68ba69fd9]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55a68b7f5d1f]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a68bdb4006]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a68bdb6b20]
 15: (()+0x7ea5) [0x7f1546778ea5]
 16: (clone()+0x6d) [0x7f154563b9fd]

Actions #9

Updated by hoan nv over 1 year ago

Justin Mammarella wrote:

We are seeing this bug in Nautilus 14.2.15 to 14.2.22 replicated pool.

Two of our osds are stuck in a crash loop trying to backfill. We have disabled backfills for now and are looking at removing the offending object from the pool. If that fails we will try and manually backfill to the target osd (479).

How you manually backfill to the target osd ?
Can you show me command or guide.
Thanks

Actions

Also available in: Atom PDF