Project

General

Profile

Bug #50608

ceph_assert(is_primary()) in PrimaryLogPG::on_local_recover

Added by Tobias Urdin almost 3 years ago. Updated almost 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was happening on a cluster running Nautilus 14.2.18/14.2.19 during upgrading to Octopus 15.2.11 some 3-4 OSDs crashed with this. Some of these OSDs
that crashed was already on 15.2.11 if I don't recall incorrectly.

crash1.txt View (74.3 KB) Tobias Urdin, 05/03/2021 12:04 PM

osd.90.txt View (67.9 KB) Tobias Urdin, 11/15/2021 03:27 PM

osd.119.txt View (28.7 KB) Tobias Urdin, 11/15/2021 03:27 PM

History

#1 Updated by Tobias Urdin almost 3 years ago

#2 Updated by Tobias Urdin almost 3 years ago

Longer debug_osd output in crash1.txt file.

-1> 2021-04-29T13:51:57.756+0200 7fa2edae2700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fa2edae2700 time 2021-04-29T13:51:57.751899+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a83c7ed07f]
2: (()+0x4de247) [0x55a83c7ed247]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0]
14: (()+0x7ea5) [0x7fa30fcdfea5]
15: (clone()+0x6d) [0x7fa30eba29fd]
0> 2021-04-29T13:51:57.761+0200 7fa2edae2700 -1 ** Caught signal (Aborted) *
in thread 7fa2edae2700 thread_name:tp_osd_tp
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (()+0xf630) [0x7fa30fce7630]
2: (gsignal()+0x37) [0x7fa30eada3d7]
3: (abort()+0x148) [0x7fa30eadbac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a83c7ed0ce]
5: (()+0x4de247) [0x55a83c7ed247]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8]
10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b]
12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0]
17: (()+0x7ea5) [0x7fa30fcdfea5]
18: (clone()+0x6d) [0x7fa30eba29fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 0 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
0/ 0 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 0 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_rwl
0/ 0 journaler
0/ 0 objectcacher
0/ 5 immutable_obj_cache
0/ 0 client
30/30 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
0/ 0 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/ 5 rgw_sync
0/ 0 civetweb
0/ 0 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
0/ 0 compressor
0/ 0 bluestore
0/ 0 bluefs
0/ 0 bdev
0/ 0 kstore
0/ 0 rocksdb
0/ 0 leveldb
0/ 0 memdb
0/ 0 fuse
0/ 0 mgr
0/ 0 mgrc
0/ 0 dpdk
0/ 0 eventtrace
1/ 5 prioritycache
0/ 5 test
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
--
pthread ID / name mapping for recent threads ---
7fa2ecae0700 / tp_osd_tp
7fa2ed2e1700 / tp_osd_tp
7fa2edae2700 / tp_osd_tp
7fa2ee2e3700 / tp_osd_tp
7fa2eeae4700 / tp_osd_tp
7fa2ef2e5700 / osd_srv_agent
7fa2f7af6700 / ms_dispatch
7fa2f82f7700 / safe_timer
7fa2f9afa700 / cfin
7fa2fe303700 / ms_dispatch
7fa2ff305700 / bstore_mempool
7fa306b14700 / safe_timer
7fa307b16700 / safe_timer
7fa30b31d700 / msgr-worker-2
7fa30bb1e700 / msgr-worker-1
7fa30c31f700 / msgr-worker-0
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.47.log
--- end dump of recent events ---

#3 Updated by Tobias Urdin almost 3 years ago

Longer debug_osd output in crash1.txt file.

    -1> 2021-04-29T13:51:57.756+0200 7fa2edae2700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fa2edae2700 time 2021-04-29T13:51:57.751899+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())

 ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a83c7ed07f]
 2: (()+0x4de247) [0x55a83c7ed247]
 3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc]
 4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2]
 5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193]
 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269]
 10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0]
 14: (()+0x7ea5) [0x7fa30fcdfea5]
 15: (clone()+0x6d) [0x7fa30eba29fd]

     0> 2021-04-29T13:51:57.761+0200 7fa2edae2700 -1 *** Caught signal (Aborted) **
 in thread 7fa2edae2700 thread_name:tp_osd_tp

 ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
 1: (()+0xf630) [0x7fa30fce7630]
 2: (gsignal()+0x37) [0x7fa30eada3d7]
 3: (abort()+0x148) [0x7fa30eadbac8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a83c7ed0ce]
 5: (()+0x4de247) [0x55a83c7ed247]
 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc]
 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2]
 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8]
 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b]
 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269]
 13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0]
 17: (()+0x7ea5) [0x7fa30fcdfea5]
 18: (clone()+0x6d) [0x7fa30eba29fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 0 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 0 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_rwl
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 5 immutable_obj_cache
   0/ 0 client
  30/30 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   0/ 0 crypto
   0/ 0 finisher
   0/ 0 reserver
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 rgw_sync
   0/ 0 civetweb
   0/ 0 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   0/ 0 compressor
   0/ 0 bluestore
   0/ 0 bluefs
   0/ 0 bdev
   0/ 0 kstore
   0/ 0 rocksdb
   0/ 0 leveldb
   0/ 0 memdb
   0/ 0 fuse
   0/ 0 mgr
   0/ 0 mgrc
   0/ 0 dpdk
   0/ 0 eventtrace
   1/ 5 prioritycache
   0/ 5 test
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7fa2ecae0700 / tp_osd_tp
  7fa2ed2e1700 / tp_osd_tp
  7fa2edae2700 / tp_osd_tp
  7fa2ee2e3700 / tp_osd_tp
  7fa2eeae4700 / tp_osd_tp
  7fa2ef2e5700 / osd_srv_agent
  7fa2f7af6700 / ms_dispatch
  7fa2f82f7700 / safe_timer
  7fa2f9afa700 / cfin
  7fa2fe303700 / ms_dispatch
  7fa2ff305700 / bstore_mempool
  7fa306b14700 / safe_timer
  7fa307b16700 / safe_timer
  7fa30b31d700 / msgr-worker-2
  7fa30bb1e700 / msgr-worker-1
  7fa30c31f700 / msgr-worker-0
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.47.log
--- end dump of recent events ---

#4 Updated by Neha Ojha almost 3 years ago

  • Priority changed from Normal to Urgent

#5 Updated by Neha Ojha almost 3 years ago

The crashed OSD was running 15.2.11 but do you happen to know what version osd.59(the primary for pg 6.7a) was running at the time of the crash? I will take a look into this.

#6 Updated by Tobias Urdin almost 3 years ago

Just purely based on the numbering of OSDs I know for a fact that osd.47 was upgraded before osd.59 so based on that osd.59 would have been running 14.2.18 but I don't have a way to guarantee that.

I know that we saw the same crash for OSDs running Nautilus and Octopus so it's probable the bug spans both versions.

#7 Updated by Neha Ojha almost 3 years ago

Do the OSDs hitting this assert come up fine on restarting? or are they repeatedly hitting this assert?

#8 Updated by Tobias Urdin almost 3 years ago

Neha Ojha wrote:

Do the OSDs hitting this assert come up fine on restarting? or are they repeatedly hitting this assert?

It starts up fine after the assert, I haven't seen anything not recovering or staying down due to this issue.

#9 Updated by Tobias Urdin over 2 years ago

This is still an issue and it repeatedly hits this during recovery when upgrading the cluster where some (already upgraded ceph-osd is on 15.2.15) and the crashing one is stuck on 15.2.11 and dies every time it tries to recover.

-1> 2021-11-15T14:10:23.081+0100 7f8ff2419700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f8ff2419700 time 2021-11-15T14:10:23.075731+0100
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55e55b30107f]
2: (()+0x4de247) [0x55e55b301247]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr&lt;ObjectContext&gt;, bool, ceph::os::Transaction*)+0x30c) [0x55e55b4cf4cc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55e55b6fadd2]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr&lt;OpRequest&gt;)+0x243) [0x55e55b6fb193]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x298) [0x55e55b703cd8]
7: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x4a) [0x55e55b59f94a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x5cb) [0x55e55b545a2b]
9: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x2f9) [0x55e55b3e5269]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x69) [0x55e55b620d29]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55e55b400aca]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e55b9eb086]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e55b9edbd0]
14: (()+0x7ea5) [0x7f9013614ea5]
15: (clone()+0x6d) [0x7f90124d79fd]
0> 2021-11-15T14:10:23.087+0100 7f8ff2419700 -1 ** Caught signal (Aborted) *
in thread 7f8ff2419700 thread_name:tp_osd_tp
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (()+0xf630) [0x7f901361c630]
2: (gsignal()+0x37) [0x7f901240f3d7]
3: (abort()+0x148) [0x7f9012410ac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55e55b3010ce]
5: (()+0x4de247) [0x55e55b301247]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr&lt;ObjectContext&gt;, bool, ceph::os::Transaction*)+0x30c) [0x55e55b4cf4cc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55e55b6fadd2]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr&lt;OpRequest&gt;)+0x243) [0x55e55b6fb193]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x298) [0x55e55b703cd8]
10: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x4a) [0x55e55b59f94a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x5cb) [0x55e55b545a2b]
12: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x2f9) [0x55e55b3e5269]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x69) [0x55e55b620d29]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55e55b400aca]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e55b9eb086]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e55b9edbd0]
17: (()+0x7ea5) [0x7f9013614ea5]
18: (clone()+0x6d) [0x7f90124d79fd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 0 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
0/ 0 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 0 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_rwl
0/ 0 journaler
0/ 0 objectcacher
0/ 5 immutable_obj_cache
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
0/ 0 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/ 5 rgw_sync
0/ 0 civetweb
0/ 0 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
0/ 0 compressor
0/ 0 bluestore
0/ 0 bluefs
0/ 0 bdev
0/ 0 kstore
0/ 0 rocksdb
0/ 0 leveldb
0/ 0 memdb
0/ 0 fuse
0/ 0 mgr
0/ 0 mgrc
0/ 0 dpdk
0/ 0 eventtrace
1/ 5 prioritycache
0/ 5 test
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
--
pthread ID / name mapping for recent threads ---
7f8ff2419700 / tp_osd_tp
7f9002c3a700 / bstore_mempool
7f9008c46700 / fn_anonymous
7f900cc4e700 / signal_handler
7f900e451700 / service
7f900ec52700 / msgr-worker-2
7f900f453700 / msgr-worker-1
7f900fc54700 / msgr-worker-0
7f90155a9bc0 / ceph-osd
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.107.log
--- end dump of recent events ---

#10 Updated by Tobias Urdin over 2 years ago

PG_STAT  STATE                                           UP            UP_PRIMARY  ACTING    ACTING_PRIMARY
6.0      active+recovering+undersized+degraded+remapped  [107,90,119]         107  [90,119]              90

osd.90 = Octopus 15.2.11
osd.107 = Octopus 15.2.11
osd.119 = Octopus 15.2.15

And osd.107 starts crashing all the time so the PG never recovers, after a while during recovery osd.107 just dies on the same assert.

This seems to only happend during upgrades since that's when recovery normally happens when restarting OSDs.

The osd.107 with debug_osd=5/5 seems to be doing a lot of heartbeat osd_stat and trying to catch up.

2021-11-15T14:55:23.419+0100 7f3eb4c69700  5 osd.107 1105828 heartbeat osd_stat(store_statfs(0x678ee000000/0x40000000/0x74e823fe000, data 0xa1d8e91ff4/0xcdd43f0000, compress 0x0/0x0/0x0, omap 0x271fc5a, meta 0x1b17de3a6), peers [3,10,17,24,31,40,42,51,56,77,78,86,88,89,90,98,106,108,114,117,119,120,123,124,126,128,130,195,197,199] op hist [0,0,0,0,2,5,8,3,6,18,17,4,6])
2021-11-15T14:56:12.944+0100 7f3ec53dc700  3 osd.107 1105831 handle_osd_map epochs [1105832,1105832], i have 1105831, src has [1105109,1105832]

For osd.90 it does something similar.

2021-11-15T14:56:03.930+0100 7f94f68b0700  0 log_channel(cluster) log [DBG] : purged_snaps scrub starts
2021-11-15T14:56:03.964+0100 7f94f68b0700  0 log_channel(cluster) log [DBG] : purged_snaps scrub found 6 strays
2021-11-15T14:58:12.629+0100 7f94e6890700  3 osd.90 1105835 handle_osd_map epochs [1105836,1105836], i have 1105835, src has [1105109,1105836]
2021-11-15T14:58:12.694+0100 7f94e6890700  3 osd.90 1105835 handle_osd_map epochs [1105836,1105836], i have 1105836, src has [1105109,1105836]
2021-11-15T14:58:12.694+0100 7f94e6890700  3 osd.90 1105835 handle_osd_map epochs [1105836,1105836], i have 1105836, src has [1105109,1105836]
2021-11-15T14:58:14.855+0100 7f94db079700  5 osd.90 1105836 heartbeat osd_stat(store_statfs(0x64e5e290000/0x40000000/0x74e823fe000, data 0xd3c8fd481b/0xf864150000, compress 0x0/0x0/0x0, omap 0x1a75147, meta 0x11b088eb9), peers [0,7,14,21,28,39,41,43,52,59,64,65,68,74,75,78,80,81,89,91,104,105,107,115,119,121,147,148,155] op hist [0,0,0,1,0,0,1,2,4,16,18,7,3])

#11 Updated by Tobias Urdin over 2 years ago

After upgrading osd.107 to 15.5.15 and waiting 2 hours for it to recover 3,000 objects in a single PG it crashed again with 5 objects left.

PG 6.0 is the one that is crashing all the time.

PG_STAT  STATE                                              UP            UP_PRIMARY  ACTING    ACTING_PRIMARY
6.0      active+recovery_wait+undersized+degraded+remapped  [107,90,119]         107  [90,119]              90
6.a         active+recovering+undersized+degraded+remapped   [58,50,107]          58   [58,50]              58
6.5         active+recovering+undersized+degraded+remapped  [107,128,89]         107  [89,128]              89

osd.107

   -2> 2021-11-15T16:16:46.253+0100 7f3eb746e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f3eb746e700 time 2021-11-15T16:16:46.244742+0100
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())

 ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a8cc3e8419]
 2: (()+0x4df5e1) [0x55a8cc3e85e1]
 3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a8cc5b6ebc]
 4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a8cc7e2622]
 5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a8cc7e29e3]
 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a8cc7eb548]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a8cc68732a]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a8cc62d47b]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a8cc4ccb69]
 10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a8cc708609]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a8cc4e844a]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a8ccad4206]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a8ccad6d50]
 14: (()+0x7ea5) [0x7f3ed8669ea5]
 15: (clone()+0x6d) [0x7f3ed752c9fd]

    -1> 2021-11-15T16:16:46.258+0100 7f3ec7c8f700  5 prioritycache tune_memory target: 4294967296 mapped: 835739648 unmapped: 135315456 heap: 971055104 old mem: 2845415832 new mem: 2845415832
     0> 2021-11-15T16:16:46.264+0100 7f3eb746e700 -1 *** Caught signal (Aborted) **
 in thread 7f3eb746e700 thread_name:tp_osd_tp

 ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
 1: (()+0xf630) [0x7f3ed8671630]
 2: (gsignal()+0x37) [0x7f3ed74643d7]
 3: (abort()+0x148) [0x7f3ed7465ac8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a8cc3e8468]
 5: (()+0x4df5e1) [0x55a8cc3e85e1]
 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a8cc5b6ebc]
 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a8cc7e2622]
 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a8cc7e29e3]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a8cc7eb548]
 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a8cc68732a]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a8cc62d47b]
 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a8cc4ccb69]
 13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a8cc708609]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a8cc4e844a]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a8ccad4206]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a8ccad6d50]
 17: (()+0x7ea5) [0x7f3ed8669ea5]
 18: (clone()+0x6d) [0x7f3ed752c9fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 0 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 0 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_rwl
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 5 immutable_obj_cache
   0/ 0 client
   5/ 5 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   0/ 0 crypto
   0/ 0 finisher
   0/ 0 reserver
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 rgw_sync
   0/ 0 civetweb
   0/ 0 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   0/ 0 compressor
   5/ 5 bluestore
   5/ 5 bluefs
   0/ 0 bdev
   0/ 0 kstore
   0/ 0 rocksdb
   0/ 0 leveldb
   0/ 0 memdb
   0/ 0 fuse
   0/ 0 mgr
   0/ 0 mgrc
   0/ 0 dpdk
   0/ 0 eventtrace
   1/ 5 prioritycache
   0/ 5 test
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f3eb4c69700 / osd_srv_heartbt
  7f3eb746e700 / tp_osd_tp
  7f3ec0480700 / ms_dispatch
  7f3ec2c85700 / bstore_kv_sync
  7f3ec53dc700 / ms_dispatch
  7f3ec7c8f700 / bstore_mempool
  7f3ed04a0700 / safe_timer
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.107.log
--- end dump of recent events ---

Then osd.119 and osd.90 logs attached.

#12 Updated by Neha Ojha over 2 years ago

Tobias Urdin wrote:

After upgrading osd.107 to 15.5.15 and waiting 2 hours for it to recover 3,000 objects in a single PG it crashed again with 5 objects left.

PG 6.0 is the one that is crashing all the time.

Can you set debug_osd=20 and debug_ms=1 for all the replicas of this PG and share the logs using ceph-post-file?

[...]

osd.107
[...]

Then osd.119 and osd.90 logs attached.

#13 Updated by Neha Ojha about 2 years ago

  • Priority changed from Urgent to High

#14 Updated by Neha Ojha about 2 years ago

  • Status changed from New to Need More Info

#15 Updated by Radoslaw Zarzynski almost 2 years ago

  • Priority changed from High to Normal

Lowering the priority further as there is no new info since the last time.

#16 Updated by Tobias Urdin almost 2 years ago

OSD crashed with this again

{
"archived": "2022-04-15 23:07:21.580173",
"assert_condition": "is_primary()",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc",
"assert_func": "virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)",
"assert_line": 384,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fedd1a4b700 time 2022-04-16T00:06:21.303068+0200\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"(()+0xf630) [0x7fedf4451630]",
"(gsignal()+0x37) [0x7fedf32443d7]",
"(abort()+0x148) [0x7fedf3245ac8]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a153bfa468]",
"(()+0x4df5e1) [0x55a153bfa5e1]",
"(PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a153dc8ebc]",
"(ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a153ff4622]",
"(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a153ff49e3]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a153ffd548]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a153e9932a]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a153e3f47b]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a153cdeb69]",
"(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a153f1a609]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a153cfa44a]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a1542e6206]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a1542e8d50]",
"(()+0x7ea5) [0x7fedf4449ea5]",
"(clone()+0x6d) [0x7fedf330c9fd]"
],
"ceph_version": "15.2.15",
"crash_id": "2022-04-15T22:06:21.474052Z_810c9df1-581d-4ff8-8b2f-7c8e03e31d51",
"entity_name": "osd.109",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "7 (Core)",
"os_version_id": "7",
"process_name": "ceph-osd",
"stack_sig": "a1142657badf06abb0885b2bfc25c7cf0719904822ec7d99d19e413363b1ff7f",
"timestamp": "2022-04-15T22:06:21.474052Z",
"utsname_hostname": "censored",
"utsname_machine": "x86_64",
"utsname_release": "3.10.0-1160.24.1.el7.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Thu Apr 8 19:51:47 UTC 2021"
}

#17 Updated by Tobias Urdin almost 2 years ago

2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.109 down, but it is still running
2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [DBG] : map e1166755 wrongly marked me down at e1166754
2022-04-16T00:06:11.018+0200 7fede9a7b700 -1 osd.109 1166756 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2022-04-16T00:06:21.460+0200 7fedd1a4b700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fedd1a4b700 time 2022-04-16T00:06:21.303068+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())

ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a153bfa419]
2: (()+0x4df5e1) [0x55a153bfa5e1]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr&lt;ObjectContext&gt;, bool, ceph::os::Transaction*)+0x30c) [0x55a153dc8ebc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a153ff4622]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr&lt;OpRequest&gt;)+0x243) [0x55a153ff49e3]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x298) [0x55a153ffd548]
7: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x4a) [0x55a153e9932a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x5cb) [0x55a153e3f47b]
9: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x2f9) [0x55a153cdeb69]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x69) [0x55a153f1a609]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a153cfa44a]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a1542e6206]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a1542e8d50]
14: (()+0x7ea5) [0x7fedf4449ea5]
15: (clone()+0x6d) [0x7fedf330c9fd]

2022-04-16T00:06:21.477+0200 7fedd1a4b700 -1 ** Caught signal (Aborted) *
in thread 7fedd1a4b700 thread_name:tp_osd_tp

ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
1: (()+0xf630) [0x7fedf4451630]
2: (gsignal()+0x37) [0x7fedf32443d7]
3: (abort()+0x148) [0x7fedf3245ac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a153bfa468]
5: (()+0x4df5e1) [0x55a153bfa5e1]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr&lt;ObjectContext&gt;, bool, ceph::os::Transaction*)+0x30c) [0x55a153dc8ebc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a153ff4622]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr&lt;OpRequest&gt;)+0x243) [0x55a153ff49e3]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x298) [0x55a153ffd548]
10: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x4a) [0x55a153e9932a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x5cb) [0x55a153e3f47b]
12: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x2f9) [0x55a153cdeb69]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x69) [0x55a153f1a609]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a153cfa44a]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a1542e6206]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a1542e8d50]
17: (()+0x7ea5) [0x7fedf4449ea5]
18: (clone()+0x6d) [0x7fedf330c9fd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---

#18 Updated by Tobias Urdin almost 2 years ago

There is a lot of lines like this before the crash line in the log file

-24> 2022-04-16T00:06:02.105+0200 7fede3a6f700  5 prioritycache tune_memory target: 4294967296 mapped: 3484655616 unmapped: 1090576384 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-23> 2022-04-16T00:06:03.109+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484442624 unmapped: 1090789376 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-22> 2022-04-16T00:06:04.113+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3485425664 unmapped: 1089806336 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-21> 2022-04-16T00:06:05.117+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3487580160 unmapped: 1087651840 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-20> 2022-04-16T00:06:06.121+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3492814848 unmapped: 1082417152 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-19> 2022-04-16T00:06:07.125+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496845312 unmapped: 1078386688 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-18> 2022-04-16T00:06:08.129+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3502202880 unmapped: 1073029120 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-17> 2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.109 down, but it is still running
-16> 2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [DBG] : map e1166755 wrongly marked me down at e1166754
-15> 2022-04-16T00:06:09.136+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3499237376 unmapped: 1075994624 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-14> 2022-04-16T00:06:10.140+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496075264 unmapped: 1079156736 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-13> 2022-04-16T00:06:11.018+0200 7fede9a7b700 -1 osd.109 1166756 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
-12> 2022-04-16T00:06:11.142+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496026112 unmapped: 1079205888 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-11> 2022-04-16T00:06:12.168+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496411136 unmapped: 1078820864 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-10> 2022-04-16T00:06:13.173+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3491913728 unmapped: 1083318272 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-9> 2022-04-16T00:06:14.179+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3490291712 unmapped: 1084940288 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-8> 2022-04-16T00:06:15.183+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484180480 unmapped: 1091051520 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-7> 2022-04-16T00:06:16.187+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484106752 unmapped: 1091125248 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-6> 2022-04-16T00:06:17.190+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484155904 unmapped: 1091076096 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-5> 2022-04-16T00:06:18.193+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484057600 unmapped: 1091174400 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-4> 2022-04-16T00:06:19.197+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484286976 unmapped: 1090945024 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-3> 2022-04-16T00:06:20.200+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484229632 unmapped: 1091002368 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-2> 2022-04-16T00:06:21.204+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484057600 unmapped: 1091174400 heap: 4575232000 old mem: 2845415832 new mem: 2845415832

#19 Updated by Tobias Urdin almost 2 years ago

2022-04-16T00:06:06.526+0200 7f6997402700 -1 osd.110 1166753 heartbeat_check: no reply from <censor>:6812 osd.109 since back 2022-04-16T00:05:44.177443+0200 front 2022-04-16T00:05:44.177218+0200 (oldest deadline 2022-04-16T00:06:06.476672+0200)

#20 Updated by Radoslaw Zarzynski almost 2 years ago

Hello! Looks it's reproducible which is good.
Would you be able to provide logs with extra debugs as mentioned in https://tracker.ceph.com/issues/50608#note-12?

#21 Updated by Tobias Urdin almost 2 years ago

'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f33daf2a700 time 2022-05-09T12:00:34.617734+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.16/rpm/el7/BUILD/ceph-15.2.16/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())

ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x5639998ab8a9]
2: (()+0x4e0a71) [0x5639998aba71]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr&lt;ObjectContext&gt;, bool, ceph::os::Transaction*)+0x30c) [0x563999a784dc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x563999ca3d52]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr&lt;OpRequest&gt;)+0x243) [0x563999ca4113]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x298) [0x563999cacc78]
7: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x4a) [0x563999b4895a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x5cb) [0x563999aeeaab]
9: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x2f9) [0x56399998e029]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x69) [0x563999bc9c39]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x5639999a990a]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x563999f9a7a6]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x563999f9d2f0]
14: (()+0x7ea5) [0x7f33fc926ea5]
15: (clone()+0x6d) [0x7f33fb7e99fd]
0> 2022-05-09T12:00:34.642+0200 7f33daf2a700 -1 ** Caught signal (Aborted) *
in thread 7f33daf2a700 thread_name:tp_osd_tp
ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
1: (()+0xf630) [0x7f33fc92e630]
2: (gsignal()+0x37) [0x7f33fb7213d7]
3: (abort()+0x148) [0x7f33fb722ac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x5639998ab8f8]
5: (()+0x4e0a71) [0x5639998aba71]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr&lt;ObjectContext&gt;, bool, ceph::os::Transaction*)+0x30c) [0x563999a784dc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x563999ca3d52]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr&lt;OpRequest&gt;)+0x243) [0x563999ca4113]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x298) [0x563999cacc78]
10: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x4a) [0x563999b4895a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x5cb) [0x563999aeeaab]
12: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x2f9) [0x56399998e029]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x69) [0x563999bc9c39]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x5639999a990a]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x563999f9a7a6]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x563999f9d2f0]
17: (()+0x7ea5) [0x7f33fc926ea5]
18: (clone()+0x6d) [0x7f33fb7e99fd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#22 Updated by Tobias Urdin almost 2 years ago

Always happens when you upgrade nodes, probably some timing issue with PGs going or flapping primary. I never have debug log enabled in production by default so this is hard to catch logs for.

#23 Updated by Tobias Urdin almost 2 years ago

(upgrade and restart OSDs is probably more accurate wording). If I upgrade node #2 and OSD on node #1 would die with this etc.

Also available in: Atom PDF