Bug #55577
openOSD crashes on devicehealth scraping
0%
Description
- Linux kernel version: `5.17.5-arch1-1`
- Ceph version: `17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)`
- Rook version: `v1.9.2`
The OSD crash as soon as the devicehealth is scraped, e.g., by executing `ceph device scrape-daemon-health-metrics osd.2`.
The OSD container gets restarted by kubernetes because the liveness probe command (`ceph --admin-daemon /run/ceph/ceph-osd.2.asok status`) fails three times in a row.
Within the pod, I'm able to execute the `smartctl` command without any errors.
Last few lines of the OSD log:
```
debug -7> 2022-05-09T10:54:38.559+0000 7f975380e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-09T10:54:08.561261+0000)
debug -6> 2022-05-09T10:54:39.559+0000 7f975380e700 10 monclient: tick
debug -5> 2022-05-09T10:54:39.559+0000 7f975380e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-09T10:54:09.561363+0000)
debug -4> 2022-05-09T10:54:40.559+0000 7f975380e700 10 monclient: tick
debug -3> 2022-05-09T10:54:40.559+0000 7f975380e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-09T10:54:10.561470+0000)
debug -2> 2022-05-09T10:54:41.439+0000 7f9768245700 10 monclient: get_auth_request con 0x55c8063da000 auth_method 0
debug -1> 2022-05-09T10:54:41.439+0000 7f9768a46700 10 monclient: get_auth_request con 0x55c7e90f1c00 auth_method 0
debug 0> 2022-05-09T10:54:41.443+0000 7f974b7fe700 -1 ** Caught signal (Segmentation fault) *
in thread 7f974b7fe700 thread_name:safe_timer
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)
1: /lib64/libpthread.so.0(+0x12ce0) [0x7f976ca3fce0]
2: (BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, std::__cxx11::list<Context*, std::allocator<Context*> >, boost::intrusive_ptr<TrackedOp>)+0x3ae) [0x55c7e502855e]
3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle)+0x260) [0x55c7e508d3e0]
4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x55) [0x55c7e4c8d2e5]
5: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xca8) [0x55c7e4ea7f98]
6: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xc90) [0x55c7e4bf3da0]
7: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x120) [0x55c7e4bf6000]
8: (PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xb99) [0x55c7e4bf83f9]
9: (HandleWatchTimeout::complete(int)+0x11b) [0x55c7e4b7a4ab]
10: (CommonSafeTimer<std::mutex>::timer_thread()+0x11a) [0x55c7e51f8afa]
11: (CommonSafeTimerThread<std::mutex>::entry()+0x11) [0x55c7e51fa121]
12: /lib64/libpthread.so.0(+0x81cf) [0x7f976ca351cf]
13: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
2/-2 (syslog threshold) pthread ID / name mapping for recent threads ---
99/99 (stderr threshold)
--
7f9746ff5700 /
```
Updated by Radoslaw Zarzynski over 1 year ago
- Project changed from mgr to bluestore
- Category deleted (
devicehealth module)
Updated by Deepika Upadhyay 12 months ago
I have observed this failure quite often in a cluster,
Crash ID: 2023-05-05T00:27:07.658249Z_66234a76-ca82-4b9e-a506-a8ba06be1ea8 Crash Info: { "backtrace": [ "/lib64/libpthread.so.0(+0x12cf0) [0x7f7dac671cf0]", "(BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, std::__cxx11::list<Context*, std::allocator<Context*> >*, boost::intrusive_ptr<TrackedOp>)+0x40a) [0x55bc11a3ed5a]", "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x21e) [0x55bc11aba99e]", "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x53) [0x55bc1167f203]", "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7c2) [0x55bc118d0ae2]", "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55bc115f6ebd]", "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x5a) [0x55bc115f8cfa]", "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0x87b) [0x55bc115faf1b]", "(HandleWatchTimeout::complete(int)+0x11a) [0x55bc1155f86a]", "(CommonSafeTimer<std::mutex>::timer_thread()+0x12f) [0x55bc11c245ef]", "(CommonSafeTimerThread<std::mutex>::entry()+0x11) [0x55bc11c256e1]", "/lib64/libpthread.so.0(+0x81ca) [0x7f7dac6671ca]", "clone()" ], "ceph_version": "17.2.5", "crash_id": "2023-05-05T00:27:07.658249Z_66234a76-ca82-4b9e-a506-a8ba06be1ea8", "entity_name": "osd.107", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "649d2cb3ae548d2230ae867ccc544881777512ffefead6b85524131ae2aeca00", "timestamp": "2023-05-05T00:27:07.658249Z", "utsname_hostname": "rook-ceph-osd-107-567b787788-r6cgw", "utsname_machine": "x86_64", "utsname_release": "5.14.0-162.6.1.el9_1.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Tue Nov 15 07:49:10 EST 2022" }
@Igor Gajowiak/@Adam are there any workarounds that we can do, or is there anything I can do to help resolve this bug? Thanks!
@Yarrit, do you think it's related to telemetry?