Bug #50608
ceph_assert(is_primary()) in PrimaryLogPG::on_local_recover
0%
Description
This was happening on a cluster running Nautilus 14.2.18/14.2.19 during upgrading to Octopus 15.2.11 some 3-4 OSDs crashed with this. Some of these OSDs
that crashed was already on 15.2.11 if I don't recall incorrectly.
History
#1 Updated by Tobias Urdin almost 3 years ago
- File crash1.txt View added
#2 Updated by Tobias Urdin almost 3 years ago
Longer debug_osd output in crash1.txt file.
-1> 2021-04-29T13:51:57.756+0200 7fa2edae2700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fa2edae2700 time 2021-04-29T13:51:57.751899+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a83c7ed07f]
2: (()+0x4de247) [0x55a83c7ed247]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0]
14: (()+0x7ea5) [0x7fa30fcdfea5]
15: (clone()+0x6d) [0x7fa30eba29fd]
0> 2021-04-29T13:51:57.761+0200 7fa2edae2700 -1 ** Caught signal (Aborted) *
in thread 7fa2edae2700 thread_name:tp_osd_tp
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (()+0xf630) [0x7fa30fce7630]
2: (gsignal()+0x37) [0x7fa30eada3d7]
3: (abort()+0x148) [0x7fa30eadbac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a83c7ed0ce]
5: (()+0x4de247) [0x55a83c7ed247]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8]
10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b]
12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0]
17: (()+0x7ea5) [0x7fa30fcdfea5]
18: (clone()+0x6d) [0x7fa30eba29fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 0 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
0/ 0 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 0 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_rwl
0/ 0 journaler
0/ 0 objectcacher
0/ 5 immutable_obj_cache
0/ 0 client
30/30 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
0/ 0 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/ 5 rgw_sync
0/ 0 civetweb
0/ 0 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
0/ 0 compressor
0/ 0 bluestore
0/ 0 bluefs
0/ 0 bdev
0/ 0 kstore
0/ 0 rocksdb
0/ 0 leveldb
0/ 0 memdb
0/ 0 fuse
0/ 0 mgr
0/ 0 mgrc
0/ 0 dpdk
0/ 0 eventtrace
1/ 5 prioritycache
0/ 5 test
2/-2 (syslog threshold) pthread ID / name mapping for recent threads ---
-1/-1 (stderr threshold)
--
7fa2ecae0700 / tp_osd_tp
7fa2ed2e1700 / tp_osd_tp
7fa2edae2700 / tp_osd_tp
7fa2ee2e3700 / tp_osd_tp
7fa2eeae4700 / tp_osd_tp
7fa2ef2e5700 / osd_srv_agent
7fa2f7af6700 / ms_dispatch
7fa2f82f7700 / safe_timer
7fa2f9afa700 / cfin
7fa2fe303700 / ms_dispatch
7fa2ff305700 / bstore_mempool
7fa306b14700 / safe_timer
7fa307b16700 / safe_timer
7fa30b31d700 / msgr-worker-2
7fa30bb1e700 / msgr-worker-1
7fa30c31f700 / msgr-worker-0
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.47.log
--- end dump of recent events ---
#3 Updated by Tobias Urdin almost 3 years ago
Longer debug_osd output in crash1.txt file.
-1> 2021-04-29T13:51:57.756+0200 7fa2edae2700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fa2edae2700 time 2021-04-29T13:51:57.751899+0200 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary()) ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a83c7ed07f] 2: (()+0x4de247) [0x55a83c7ed247] 3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc] 4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2] 5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193] 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269] 10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0] 14: (()+0x7ea5) [0x7fa30fcdfea5] 15: (clone()+0x6d) [0x7fa30eba29fd] 0> 2021-04-29T13:51:57.761+0200 7fa2edae2700 -1 *** Caught signal (Aborted) ** in thread 7fa2edae2700 thread_name:tp_osd_tp ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable) 1: (()+0xf630) [0x7fa30fce7630] 2: (gsignal()+0x37) [0x7fa30eada3d7] 3: (abort()+0x148) [0x7fa30eadbac8] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a83c7ed0ce] 5: (()+0x4de247) [0x55a83c7ed247] 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a83c9bb4cc] 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a83cbe6dd2] 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a83cbe7193] 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a83cbefcd8] 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a83ca8b94a] 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a83ca31a2b] 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a83c8d1269] 13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a83cb0cd29] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a83c8ecaca] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a83ced7086] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a83ced9bd0] 17: (()+0x7ea5) [0x7fa30fcdfea5] 18: (clone()+0x6d) [0x7fa30eba29fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 0 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 0 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_rwl 0/ 0 journaler 0/ 0 objectcacher 0/ 5 immutable_obj_cache 0/ 0 client 30/30 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 0/ 0 crypto 0/ 0 finisher 0/ 0 reserver 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 rgw_sync 0/ 0 civetweb 0/ 0 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 0/ 0 compressor 0/ 0 bluestore 0/ 0 bluefs 0/ 0 bdev 0/ 0 kstore 0/ 0 rocksdb 0/ 0 leveldb 0/ 0 memdb 0/ 0 fuse 0/ 0 mgr 0/ 0 mgrc 0/ 0 dpdk 0/ 0 eventtrace 1/ 5 prioritycache 0/ 5 test -2/-2 (syslog threshold) -1/-1 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7fa2ecae0700 / tp_osd_tp 7fa2ed2e1700 / tp_osd_tp 7fa2edae2700 / tp_osd_tp 7fa2ee2e3700 / tp_osd_tp 7fa2eeae4700 / tp_osd_tp 7fa2ef2e5700 / osd_srv_agent 7fa2f7af6700 / ms_dispatch 7fa2f82f7700 / safe_timer 7fa2f9afa700 / cfin 7fa2fe303700 / ms_dispatch 7fa2ff305700 / bstore_mempool 7fa306b14700 / safe_timer 7fa307b16700 / safe_timer 7fa30b31d700 / msgr-worker-2 7fa30bb1e700 / msgr-worker-1 7fa30c31f700 / msgr-worker-0 max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.47.log --- end dump of recent events ---
#4 Updated by Neha Ojha almost 3 years ago
- Priority changed from Normal to Urgent
#5 Updated by Neha Ojha almost 3 years ago
The crashed OSD was running 15.2.11 but do you happen to know what version osd.59(the primary for pg 6.7a) was running at the time of the crash? I will take a look into this.
#6 Updated by Tobias Urdin almost 3 years ago
Just purely based on the numbering of OSDs I know for a fact that osd.47 was upgraded before osd.59 so based on that osd.59 would have been running 14.2.18 but I don't have a way to guarantee that.
I know that we saw the same crash for OSDs running Nautilus and Octopus so it's probable the bug spans both versions.
#7 Updated by Neha Ojha almost 3 years ago
Do the OSDs hitting this assert come up fine on restarting? or are they repeatedly hitting this assert?
#8 Updated by Tobias Urdin almost 3 years ago
Neha Ojha wrote:
Do the OSDs hitting this assert come up fine on restarting? or are they repeatedly hitting this assert?
It starts up fine after the assert, I haven't seen anything not recovering or staying down due to this issue.
#9 Updated by Tobias Urdin over 2 years ago
This is still an issue and it repeatedly hits this during recovery when upgrading the cluster where some (already upgraded ceph-osd is on 15.2.15) and the crashing one is stuck on 15.2.11 and dies every time it tries to recover.
-1> 2021-11-15T14:10:23.081+0100 7f8ff2419700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f8ff2419700 time 2021-11-15T14:10:23.075731+0100
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.11/rpm/el7/BUILD/ceph-15.2.11/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55e55b30107f]
2: (()+0x4de247) [0x55e55b301247]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55e55b4cf4cc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55e55b6fadd2]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55e55b6fb193]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55e55b703cd8]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55e55b59f94a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55e55b545a2b]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55e55b3e5269]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55e55b620d29]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55e55b400aca]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e55b9eb086]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e55b9edbd0]
14: (()+0x7ea5) [0x7f9013614ea5]
15: (clone()+0x6d) [0x7f90124d79fd]
0> 2021-11-15T14:10:23.087+0100 7f8ff2419700 -1 ** Caught signal (Aborted) *
in thread 7f8ff2419700 thread_name:tp_osd_tp
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
1: (()+0xf630) [0x7f901361c630]
2: (gsignal()+0x37) [0x7f901240f3d7]
3: (abort()+0x148) [0x7f9012410ac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55e55b3010ce]
5: (()+0x4de247) [0x55e55b301247]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55e55b4cf4cc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55e55b6fadd2]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55e55b6fb193]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55e55b703cd8]
10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55e55b59f94a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55e55b545a2b]
12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55e55b3e5269]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55e55b620d29]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55e55b400aca]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55e55b9eb086]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e55b9edbd0]
17: (()+0x7ea5) [0x7f9013614ea5]
18: (clone()+0x6d) [0x7f90124d79fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 0 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
0/ 0 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 0 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_rwl
0/ 0 journaler
0/ 0 objectcacher
0/ 5 immutable_obj_cache
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
0/ 0 reserver
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/ 5 rgw_sync
0/ 0 civetweb
0/ 0 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
0/ 0 compressor
0/ 0 bluestore
0/ 0 bluefs
0/ 0 bdev
0/ 0 kstore
0/ 0 rocksdb
0/ 0 leveldb
0/ 0 memdb
0/ 0 fuse
0/ 0 mgr
0/ 0 mgrc
0/ 0 dpdk
0/ 0 eventtrace
1/ 5 prioritycache
0/ 5 test
2/-2 (syslog threshold) pthread ID / name mapping for recent threads ---
-1/-1 (stderr threshold)
--
7f8ff2419700 / tp_osd_tp
7f9002c3a700 / bstore_mempool
7f9008c46700 / fn_anonymous
7f900cc4e700 / signal_handler
7f900e451700 / service
7f900ec52700 / msgr-worker-2
7f900f453700 / msgr-worker-1
7f900fc54700 / msgr-worker-0
7f90155a9bc0 / ceph-osd
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.107.log
--- end dump of recent events ---
#10 Updated by Tobias Urdin over 2 years ago
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 6.0 active+recovering+undersized+degraded+remapped [107,90,119] 107 [90,119] 90
osd.90 = Octopus 15.2.11
osd.107 = Octopus 15.2.11
osd.119 = Octopus 15.2.15
And osd.107 starts crashing all the time so the PG never recovers, after a while during recovery osd.107 just dies on the same assert.
This seems to only happend during upgrades since that's when recovery normally happens when restarting OSDs.
The osd.107 with debug_osd=5/5 seems to be doing a lot of heartbeat osd_stat and trying to catch up.
2021-11-15T14:55:23.419+0100 7f3eb4c69700 5 osd.107 1105828 heartbeat osd_stat(store_statfs(0x678ee000000/0x40000000/0x74e823fe000, data 0xa1d8e91ff4/0xcdd43f0000, compress 0x0/0x0/0x0, omap 0x271fc5a, meta 0x1b17de3a6), peers [3,10,17,24,31,40,42,51,56,77,78,86,88,89,90,98,106,108,114,117,119,120,123,124,126,128,130,195,197,199] op hist [0,0,0,0,2,5,8,3,6,18,17,4,6]) 2021-11-15T14:56:12.944+0100 7f3ec53dc700 3 osd.107 1105831 handle_osd_map epochs [1105832,1105832], i have 1105831, src has [1105109,1105832]
For osd.90 it does something similar.
2021-11-15T14:56:03.930+0100 7f94f68b0700 0 log_channel(cluster) log [DBG] : purged_snaps scrub starts 2021-11-15T14:56:03.964+0100 7f94f68b0700 0 log_channel(cluster) log [DBG] : purged_snaps scrub found 6 strays 2021-11-15T14:58:12.629+0100 7f94e6890700 3 osd.90 1105835 handle_osd_map epochs [1105836,1105836], i have 1105835, src has [1105109,1105836] 2021-11-15T14:58:12.694+0100 7f94e6890700 3 osd.90 1105835 handle_osd_map epochs [1105836,1105836], i have 1105836, src has [1105109,1105836] 2021-11-15T14:58:12.694+0100 7f94e6890700 3 osd.90 1105835 handle_osd_map epochs [1105836,1105836], i have 1105836, src has [1105109,1105836] 2021-11-15T14:58:14.855+0100 7f94db079700 5 osd.90 1105836 heartbeat osd_stat(store_statfs(0x64e5e290000/0x40000000/0x74e823fe000, data 0xd3c8fd481b/0xf864150000, compress 0x0/0x0/0x0, omap 0x1a75147, meta 0x11b088eb9), peers [0,7,14,21,28,39,41,43,52,59,64,65,68,74,75,78,80,81,89,91,104,105,107,115,119,121,147,148,155] op hist [0,0,0,1,0,0,1,2,4,16,18,7,3])
#11 Updated by Tobias Urdin over 2 years ago
- File osd.90.txt View added
- File osd.119.txt View added
After upgrading osd.107 to 15.5.15 and waiting 2 hours for it to recover 3,000 objects in a single PG it crashed again with 5 objects left.
PG 6.0 is the one that is crashing all the time.
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 6.0 active+recovery_wait+undersized+degraded+remapped [107,90,119] 107 [90,119] 90 6.a active+recovering+undersized+degraded+remapped [58,50,107] 58 [58,50] 58 6.5 active+recovering+undersized+degraded+remapped [107,128,89] 107 [89,128] 89
osd.107
-2> 2021-11-15T16:16:46.253+0100 7f3eb746e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f3eb746e700 time 2021-11-15T16:16:46.244742+0100 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary()) ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a8cc3e8419] 2: (()+0x4df5e1) [0x55a8cc3e85e1] 3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a8cc5b6ebc] 4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a8cc7e2622] 5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a8cc7e29e3] 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a8cc7eb548] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a8cc68732a] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a8cc62d47b] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a8cc4ccb69] 10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a8cc708609] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a8cc4e844a] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a8ccad4206] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a8ccad6d50] 14: (()+0x7ea5) [0x7f3ed8669ea5] 15: (clone()+0x6d) [0x7f3ed752c9fd] -1> 2021-11-15T16:16:46.258+0100 7f3ec7c8f700 5 prioritycache tune_memory target: 4294967296 mapped: 835739648 unmapped: 135315456 heap: 971055104 old mem: 2845415832 new mem: 2845415832 0> 2021-11-15T16:16:46.264+0100 7f3eb746e700 -1 *** Caught signal (Aborted) ** in thread 7f3eb746e700 thread_name:tp_osd_tp ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable) 1: (()+0xf630) [0x7f3ed8671630] 2: (gsignal()+0x37) [0x7f3ed74643d7] 3: (abort()+0x148) [0x7f3ed7465ac8] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a8cc3e8468] 5: (()+0x4df5e1) [0x55a8cc3e85e1] 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a8cc5b6ebc] 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a8cc7e2622] 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a8cc7e29e3] 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a8cc7eb548] 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a8cc68732a] 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a8cc62d47b] 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a8cc4ccb69] 13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a8cc708609] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a8cc4e844a] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a8ccad4206] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a8ccad6d50] 17: (()+0x7ea5) [0x7f3ed8669ea5] 18: (clone()+0x6d) [0x7f3ed752c9fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 0 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 0 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_rwl 0/ 0 journaler 0/ 0 objectcacher 0/ 5 immutable_obj_cache 0/ 0 client 5/ 5 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 0/ 0 crypto 0/ 0 finisher 0/ 0 reserver 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 rgw_sync 0/ 0 civetweb 0/ 0 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 0/ 0 compressor 5/ 5 bluestore 5/ 5 bluefs 0/ 0 bdev 0/ 0 kstore 0/ 0 rocksdb 0/ 0 leveldb 0/ 0 memdb 0/ 0 fuse 0/ 0 mgr 0/ 0 mgrc 0/ 0 dpdk 0/ 0 eventtrace 1/ 5 prioritycache 0/ 5 test -2/-2 (syslog threshold) -1/-1 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f3eb4c69700 / osd_srv_heartbt 7f3eb746e700 / tp_osd_tp 7f3ec0480700 / ms_dispatch 7f3ec2c85700 / bstore_kv_sync 7f3ec53dc700 / ms_dispatch 7f3ec7c8f700 / bstore_mempool 7f3ed04a0700 / safe_timer max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.107.log --- end dump of recent events ---
Then osd.119 and osd.90 logs attached.
#12 Updated by Neha Ojha over 2 years ago
Tobias Urdin wrote:
After upgrading osd.107 to 15.5.15 and waiting 2 hours for it to recover 3,000 objects in a single PG it crashed again with 5 objects left.
PG 6.0 is the one that is crashing all the time.
Can you set debug_osd=20 and debug_ms=1 for all the replicas of this PG and share the logs using ceph-post-file?
[...]
osd.107
[...]Then osd.119 and osd.90 logs attached.
#13 Updated by Neha Ojha about 2 years ago
- Priority changed from Urgent to High
#14 Updated by Neha Ojha about 2 years ago
- Status changed from New to Need More Info
#15 Updated by Radoslaw Zarzynski almost 2 years ago
- Priority changed from High to Normal
Lowering the priority further as there is no new info since the last time.
#16 Updated by Tobias Urdin almost 2 years ago
OSD crashed with this again
{
"archived": "2022-04-15 23:07:21.580173",
"assert_condition": "is_primary()",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc",
"assert_func": "virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)",
"assert_line": 384,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fedd1a4b700 time 2022-04-16T00:06:21.303068+0200\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"(()+0xf630) [0x7fedf4451630]",
"(gsignal()+0x37) [0x7fedf32443d7]",
"(abort()+0x148) [0x7fedf3245ac8]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a153bfa468]",
"(()+0x4df5e1) [0x55a153bfa5e1]",
"(PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a153dc8ebc]",
"(ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a153ff4622]",
"(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a153ff49e3]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a153ffd548]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a153e9932a]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a153e3f47b]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a153cdeb69]",
"(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a153f1a609]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a153cfa44a]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a1542e6206]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a1542e8d50]",
"(()+0x7ea5) [0x7fedf4449ea5]",
"(clone()+0x6d) [0x7fedf330c9fd]"
],
"ceph_version": "15.2.15",
"crash_id": "2022-04-15T22:06:21.474052Z_810c9df1-581d-4ff8-8b2f-7c8e03e31d51",
"entity_name": "osd.109",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "7 (Core)",
"os_version_id": "7",
"process_name": "ceph-osd",
"stack_sig": "a1142657badf06abb0885b2bfc25c7cf0719904822ec7d99d19e413363b1ff7f",
"timestamp": "2022-04-15T22:06:21.474052Z",
"utsname_hostname": "censored",
"utsname_machine": "x86_64",
"utsname_release": "3.10.0-1160.24.1.el7.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Thu Apr 8 19:51:47 UTC 2021"
}
#17 Updated by Tobias Urdin almost 2 years ago
2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.109 down, but it is still running
2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [DBG] : map e1166755 wrongly marked me down at e1166754
2022-04-16T00:06:11.018+0200 7fede9a7b700 -1 osd.109 1166756 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2022-04-16T00:06:21.460+0200 7fedd1a4b700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fedd1a4b700 time 2022-04-16T00:06:21.303068+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.15/rpm/el7/BUILD/ceph-15.2.15/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())
ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x55a153bfa419]
2: (()+0x4df5e1) [0x55a153bfa5e1]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a153dc8ebc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a153ff4622]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a153ff49e3]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a153ffd548]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a153e9932a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a153e3f47b]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a153cdeb69]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a153f1a609]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a153cfa44a]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a1542e6206]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a1542e8d50]
14: (()+0x7ea5) [0x7fedf4449ea5]
15: (clone()+0x6d) [0x7fedf330c9fd]
2022-04-16T00:06:21.477+0200 7fedd1a4b700 -1 ** Caught signal (Aborted) *
in thread 7fedd1a4b700 thread_name:tp_osd_tp
ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)
1: (()+0xf630) [0x7fedf4451630]
2: (gsignal()+0x37) [0x7fedf32443d7]
3: (abort()+0x148) [0x7fedf3245ac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x55a153bfa468]
5: (()+0x4df5e1) [0x55a153bfa5e1]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x55a153dc8ebc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x55a153ff4622]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x55a153ff49e3]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x55a153ffd548]
10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55a153e9932a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x55a153e3f47b]
12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x55a153cdeb69]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a153f1a609]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x55a153cfa44a]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55a1542e6206]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a1542e8d50]
17: (()+0x7ea5) [0x7fedf4449ea5]
18: (clone()+0x6d) [0x7fedf330c9fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
#18 Updated by Tobias Urdin almost 2 years ago
There is a lot of lines like this before the crash line in the log file
-24> 2022-04-16T00:06:02.105+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484655616 unmapped: 1090576384 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-23> 2022-04-16T00:06:03.109+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484442624 unmapped: 1090789376 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-22> 2022-04-16T00:06:04.113+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3485425664 unmapped: 1089806336 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-21> 2022-04-16T00:06:05.117+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3487580160 unmapped: 1087651840 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-20> 2022-04-16T00:06:06.121+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3492814848 unmapped: 1082417152 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-19> 2022-04-16T00:06:07.125+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496845312 unmapped: 1078386688 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-18> 2022-04-16T00:06:08.129+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3502202880 unmapped: 1073029120 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-17> 2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.109 down, but it is still running
-16> 2022-04-16T00:06:08.540+0200 7fedde264700 0 log_channel(cluster) log [DBG] : map e1166755 wrongly marked me down at e1166754
-15> 2022-04-16T00:06:09.136+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3499237376 unmapped: 1075994624 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-14> 2022-04-16T00:06:10.140+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496075264 unmapped: 1079156736 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-13> 2022-04-16T00:06:11.018+0200 7fede9a7b700 -1 osd.109 1166756 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
-12> 2022-04-16T00:06:11.142+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496026112 unmapped: 1079205888 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-11> 2022-04-16T00:06:12.168+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3496411136 unmapped: 1078820864 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-10> 2022-04-16T00:06:13.173+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3491913728 unmapped: 1083318272 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-9> 2022-04-16T00:06:14.179+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3490291712 unmapped: 1084940288 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-8> 2022-04-16T00:06:15.183+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484180480 unmapped: 1091051520 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-7> 2022-04-16T00:06:16.187+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484106752 unmapped: 1091125248 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-6> 2022-04-16T00:06:17.190+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484155904 unmapped: 1091076096 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-5> 2022-04-16T00:06:18.193+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484057600 unmapped: 1091174400 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-4> 2022-04-16T00:06:19.197+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484286976 unmapped: 1090945024 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-3> 2022-04-16T00:06:20.200+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484229632 unmapped: 1091002368 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
-2> 2022-04-16T00:06:21.204+0200 7fede3a6f700 5 prioritycache tune_memory target: 4294967296 mapped: 3484057600 unmapped: 1091174400 heap: 4575232000 old mem: 2845415832 new mem: 2845415832
#19 Updated by Tobias Urdin almost 2 years ago
2022-04-16T00:06:06.526+0200 7f6997402700 -1 osd.110 1166753 heartbeat_check: no reply from <censor>:6812 osd.109 since back 2022-04-16T00:05:44.177443+0200 front 2022-04-16T00:05:44.177218+0200 (oldest deadline 2022-04-16T00:06:06.476672+0200)
#20 Updated by Radoslaw Zarzynski almost 2 years ago
Hello! Looks it's reproducible which is good.
Would you be able to provide logs with extra debugs as mentioned in https://tracker.ceph.com/issues/50608#note-12?
#21 Updated by Tobias Urdin almost 2 years ago
'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f33daf2a700 time 2022-05-09T12:00:34.617734+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.16/rpm/el7/BUILD/ceph-15.2.16/src/osd/PrimaryLogPG.cc: 384: FAILED ceph_assert(is_primary())
ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x5639998ab8a9]
2: (()+0x4e0a71) [0x5639998aba71]
3: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x563999a784dc]
4: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x563999ca3d52]
5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x563999ca4113]
6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x563999cacc78]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x563999b4895a]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x563999aeeaab]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x56399998e029]
10: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x563999bc9c39]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x5639999a990a]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x563999f9a7a6]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x563999f9d2f0]
14: (()+0x7ea5) [0x7f33fc926ea5]
15: (clone()+0x6d) [0x7f33fb7e99fd]
0> 2022-05-09T12:00:34.642+0200 7f33daf2a700 -1 ** Caught signal (Aborted) *
in thread 7f33daf2a700 thread_name:tp_osd_tp
ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
1: (()+0xf630) [0x7f33fc92e630]
2: (gsignal()+0x37) [0x7f33fb7213d7]
3: (abort()+0x148) [0x7f33fb722ac8]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x5639998ab8f8]
5: (()+0x4e0a71) [0x5639998aba71]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ceph::os::Transaction*)+0x30c) [0x563999a784dc]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x3a2) [0x563999ca3d52]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x243) [0x563999ca4113]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x298) [0x563999cacc78]
10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x563999b4895a]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5cb) [0x563999aeeaab]
12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2f9) [0x56399998e029]
13: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x563999bc9c39]
14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x143a) [0x5639999a990a]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x563999f9a7a6]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x563999f9d2f0]
17: (()+0x7ea5) [0x7f33fc926ea5]
18: (clone()+0x6d) [0x7f33fb7e99fd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#22 Updated by Tobias Urdin almost 2 years ago
Always happens when you upgrade nodes, probably some timing issue with PGs going or flapping primary. I never have debug log enabled in production by default so this is hard to catch logs for.