Bug #55577: OSD crashes on devicehealth scraping - bluestore - Ceph

Actions

Copy link

Bug #55577

open

OSD crashes on devicehealth scraping

Added by Emanuel Bennici almost 2 years ago. Updated 12 months ago.

Status:

New

Priority:

Normal

Assignee:

Yaarit Hatuka

Target version:

Ceph - v17.0.0

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.7, Ceph - v17.0.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Linux kernel version: `5.17.5-arch1-1`
Ceph version: `17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)`
Rook version: `v1.9.2`

The OSD crash as soon as the devicehealth is scraped, e.g., by executing `ceph device scrape-daemon-health-metrics osd.2`.

The OSD container gets restarted by kubernetes because the liveness probe command (`ceph --admin-daemon /run/ceph/ceph-osd.2.asok status`) fails three times in a row.
Within the pod, I'm able to execute the `smartctl` command without any errors.

Last few lines of the OSD log:
```
debug -7> 2022-05-09T10:54:38.559+0000 7f975380e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-09T10:54:08.561261+0000)
debug -6> 2022-05-09T10:54:39.559+0000 7f975380e700 10 monclient: tick
debug -5> 2022-05-09T10:54:39.559+0000 7f975380e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-09T10:54:09.561363+0000)
debug -4> 2022-05-09T10:54:40.559+0000 7f975380e700 10 monclient: tick
debug -3> 2022-05-09T10:54:40.559+0000 7f975380e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05-09T10:54:10.561470+0000)
debug -2> 2022-05-09T10:54:41.439+0000 7f9768245700 10 monclient: get_auth_request con 0x55c8063da000 auth_method 0
debug -1> 2022-05-09T10:54:41.439+0000 7f9768a46700 10 monclient: get_auth_request con 0x55c7e90f1c00 auth_method 0
debug 0> 2022-05-09T10:54:41.443+0000 7f974b7fe700 -1 ** Caught signal (Segmentation fault) *
in thread 7f974b7fe700 thread_name:safe_timer

ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f976ca3fce0]
 2: (BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; >, boost::intrusive_ptr&lt;TrackedOp&gt;)+0x3ae) [0x55c7e502855e]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, std::vector&lt;ceph::os::Transaction, std::allocator&lt;ceph::os::Transaction&gt; >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle)+0x260) [0x55c7e508d3e0]
 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector&lt;ceph::os::Transaction, std::allocator&lt;ceph::os::Transaction&gt; >&, boost::intrusive_ptr&lt;OpRequest&gt;)+0x55) [0x55c7e4c8d2e5]
 5: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr&lt;PGTransaction, std::default_delete&lt;PGTransaction&gt; >&&, eversion_t const&, eversion_t const&, std::vector&lt;pg_log_entry_t, std::allocator&lt;pg_log_entry_t&gt; >&&, std::optional&lt;pg_hit_set_history_t&gt;&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr&lt;OpRequest&gt;)+0xca8) [0x55c7e4ea7f98]
 6: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xc90) [0x55c7e4bf3da0]
 7: (PrimaryLogPG::simple_opc_submit(std::unique_ptr&lt;PrimaryLogPG::OpContext, std::default_delete&lt;PrimaryLogPG::OpContext&gt; >)+0x120) [0x55c7e4bf6000]
 8: (PrimaryLogPG::handle_watch_timeout(std::shared_ptr&lt;Watch&gt;)+0xb99) [0x55c7e4bf83f9]
 9: (HandleWatchTimeout::complete(int)+0x11b) [0x55c7e4b7a4ab]
 10: (CommonSafeTimer&lt;std::mutex&gt;::timer_thread()+0x11a) [0x55c7e51f8afa]
 11: (CommonSafeTimerThread&lt;std::mutex&gt;::entry()+0x11) [0x55c7e51fa121]
 12: /lib64/libpthread.so.0(+0x81cf) [0x7f976ca351cf]
 13: clone()
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_cleaner
0/ 5 seastore_lba
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 alienstore
1/ 5 mclock
2/-2 (syslog threshold)
99/99 (stderr threshold)
-- pthread ID / name mapping for recent threads ---
7f9746ff5700 /
```

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Project changed from mgr to bluestore
Category deleted (~~devicehealth module~~)

Actions

Copy link

Updated by Deepika Upadhyay 12 months ago

I have observed this failure quite often in a cluster,

Crash ID: 2023-05-05T00:27:07.658249Z_66234a76-ca82-4b9e-a506-a8ba06be1ea8
Crash Info:
{
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7f7dac671cf0]",
        "(BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, std::__cxx11::list<Context*, std::allocator<Context*> >*, boost::intrusive_ptr<TrackedOp>)+0x40a) [0x55bc11a3ed5a]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x21e) [0x55bc11aba99e]",
        "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x53) [0x55bc1167f203]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7c2) [0x55bc118d0ae2]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55bc115f6ebd]",
        "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x5a) [0x55bc115f8cfa]",
        "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0x87b) [0x55bc115faf1b]",
        "(HandleWatchTimeout::complete(int)+0x11a) [0x55bc1155f86a]",
        "(CommonSafeTimer<std::mutex>::timer_thread()+0x12f) [0x55bc11c245ef]",
        "(CommonSafeTimerThread<std::mutex>::entry()+0x11) [0x55bc11c256e1]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f7dac6671ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.5",
    "crash_id": "2023-05-05T00:27:07.658249Z_66234a76-ca82-4b9e-a506-a8ba06be1ea8",
    "entity_name": "osd.107",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "649d2cb3ae548d2230ae867ccc544881777512ffefead6b85524131ae2aeca00",
    "timestamp": "2023-05-05T00:27:07.658249Z",
    "utsname_hostname": "rook-ceph-osd-107-567b787788-r6cgw",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-162.6.1.el9_1.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Tue Nov 15 07:49:10 EST 2022" 
}

@Igor Gajowiak/@Adam are there any workarounds that we can do, or is there anything I can do to help resolve this bug? Thanks!

@Yarrit, do you think it's related to telemetry?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #55577

OSD crashes on devicehealth scraping

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Deepika Upadhyay 12 months ago