Bug #47740: OSD crash when increase pg_num - bluestore - Ceph

Actions

Copy link

Bug #47740

closed

OSD crash when increase pg_num

Added by 玮文胡 over 3 years ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Following #47738, after the pg_num increased from 32 to 256, one of our OSD crashed.

Here is the output of 'ceph crash info'

{
    "archived": "2020-10-02 14:48:31.821600",
    "backtrace": [
        "(()+0x12dd0) [0x7f31df550dd0]",
        "(()+0x118ed) [0x7f31df54f8ed]",
        "(__pthread_mutex_lock()+0x124) [0x7f31df548bd4]",
        "(BlueStore::SharedBlob::put()+0x70) [0x55841bd31b70]",
        "(BlueStore::Blob::put()+0x65) [0x55841bdddc75]",
        "(BlueStore::Extent::~Extent()+0x39) [0x55841bdddcc9]",
        "(BlueStore::Onode::put()+0x2bb) [0x55841bd3291b]",
        "(std::_Hashtable<ghobject_t, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<ghobject_t>, std::hash<ghobject_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase(unsigned long, std::__detail::_Hash_node_base*, std::__detail::_Hash_node<std::pair<ghobject_t const, boost::intrusive_ptr<BlueStore::Onode> >, true>*)+0x66) [0x55841bddf596]",
        "(BlueStore::OnodeSpace::_remove(ghobject_t const&)+0x29b) [0x55841bd3419b]",
        "(LruOnodeCacheShard::_trim_to(unsigned long)+0xf1) [0x55841bde46e1]",
        "(BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>&)+0x49d) [0x55841bd348cd]",
        "(BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x459) [0x55841bd41b79]",
        "(BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1124) [0x55841bda39a4]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x407) [0x55841bda5f47]",
        "(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x55841b917745]",
        "(OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0xf3) [0x55841b8b2ef3]",
        "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) [0x55841b8e1968]",
        "(ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55841bb131c6]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55841b8d45df]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55841bf0d324]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55841bf0ff84]",
        "(()+0x82de) [0x7f31df5462de]",
        "(clone()+0x43) [0x7f31de27de83]" 
    ],
    "ceph_version": "15.2.5",
    "crash_id": "2020-10-02T11:43:42.992341Z_248ea0f7-af3c-46b9-91e6-166aaa69f3bd",
    "entity_name": "osd.2",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8 (Core)",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "a94ab97e65f36fcb91a737cdd3e5733fe454df9d658c5d5183f0d434291dbd90",
    "timestamp": "2020-10-02T11:43:42.992341Z",
    "utsname_hostname": "169svr",
    "utsname_machine": "x86_64",
    "utsname_release": "4.15.0-112-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020" 
}

This seems to be a deadlock. A log in journalctl:

debug 2020-10-02T11:43:42.962+0000 7f31d855a700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31bdcb5700' had suicide timed out after 150

and a lot of

debug 2020-10-02T11:43:41.034+0000 7f31d855a700  1 osd.2 2999 is_healthy false -- internal heartbeat failed
debug 2020-10-02T11:43:41.034+0000 7f31d855a700  1 osd.2 2999 not healthy; waiting to boot
debug 2020-10-02T11:43:41.994+0000 7f31d855a700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31bdcb5700' had timed out after 15

Let me know if you need more information.

Files

osd2.log.gz (46.3 KB) osd2.log.gz

logs from journalctl

玮文胡, 10/04/2020 08:38 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Project changed from RADOS to bluestore

Igor, does this look like one of the other _trim_to crashes we had seen in our teuthology runs and are now fixed by your patch?

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Neha Ojha wrote:

Igor, does this look like one of the other _trim_to crashes we had seen in our teuthology runs and are now fixed by your patch?

I think - No, it's different.
Attached log contains a bunch of "osd_op_thread ... had timed out after 15" warnings ended up with OSD suicide:
...
Oct 02 19:43:41 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:41.994+0000 7f31d855a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31bdcb5700' had timed out after 15
Oct 02 19:43:41 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:41.994+0000 7f31d855a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31bd4b4700' had timed out after 15
Oct 02 19:43:41 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:41.994+0000 7f31d855a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31be4b6700' had timed out after 15
Oct 02 19:43:41 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:41.994+0000 7f31d855a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31bf4b8700' had timed out after 15
Oct 02 19:43:41 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:41.994+0000 7f31d855a700 1 osd.2 2999 is_healthy false -- internal heartbeat failed
Oct 02 19:43:41 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:41.994+0000 7f31d855a700 1 osd.2 2999 not healthy; waiting to boot
Oct 02 19:43:42 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:42.962+0000 7f31d855a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31bdcb5700' had timed out after 15
Oct 02 19:43:42 169svr bash³⁶⁹¹⁹: ** Caught signal (Aborted) *
Oct 02 19:43:42 169svr bash³⁶⁹¹⁹: in thread 7f31bdcb5700 thread_name:tp_osd_tp
Oct 02 19:43:42 169svr bash³⁶⁹¹⁹: debug 2020-10-02T11:43:42.962+0000 7f31d855a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f31bdcb5700' had suicide timed out after 150
Oct 02 19:43:42 169svr bash³⁶⁹¹⁹: ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)

Although there are no evident proofs I presume this is rather caused by PG (or other massive data) removals which in are results of data migration after PG num increase. It was massive data removal which caused suicides last times I saw them. E.g.
https://tracker.ceph.com/issues/45994
https://tracker.ceph.com/issues/40741

But something different, e.g. too high OSD load during the migration, might be the root cause as well...

Some questions:
1) Are osd_op_thread's timeouts are observed in OSD log after its restart? If so - could you please raise debug_bluestore to 20 for a short period of time ( 1-2 mins) and collect OSD log?
2) What are underlying devices/volumes backing this OSD? HDD and/or SSD? Standalone DB or not? How large it is if any? How Large are spillovers if any?

Generally we might want to try this PG num resize scenario and learn what load this causes to OSD. How massive are data removals and/or other data access ops.

Actions

Copy link

Updated by 玮文胡 over 3 years ago

Igor Fedotov wrote:

1) Are osd_op_thread's timeouts are observed in OSD log after its restart? If so - could you please raise debug_bluestore to 20 for a short period of time ( 1-2 mins) and collect OSD log?

No. till now, no such log observed again.

2) What are underlying devices/volumes backing this OSD? HDD and/or SSD? Standalone DB or not? How large it is if any? How Large are spillovers if any?

The underlying device is RAID0 consisting of 2 HDD, no standalone DB.

I think it could be too high OSD load during the migration. I just raise the pg_num from 32 to 64, 128 then 256 in a minute. And another OSD was marked as down almost at the same time, then I restarted it manually. But is it just expected to crash under high load?

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

玮文胡 wrote:

Igor Fedotov wrote:

1) Are osd_op_thread's timeouts are observed in OSD log after its restart? If so - could you please raise debug_bluestore to 20 for a short period of time ( 1-2 mins) and collect OSD log?

No. till now, no such log observed again.

2) What are underlying devices/volumes backing this OSD? HDD and/or SSD? Standalone DB or not? How large it is if any? How Large are spillovers if any?

The underlying device is RAID0 consisting of 2 HDD, no standalone DB.

I think it could be too high OSD load during the migration. I just raise the pg_num from 32 to 64, 128 then 256 in a minute. And another OSD was marked as down almost at the same time, then I restarted it manually. But is it just expected to crash under high load?

Definitely it's not expected to crash but from my experience it's a bad practice to put DB data on spinners. Even having RocksDB at SSD one might experience some issues under heavy DB load scenarios (and PG removal is one of it) - not to mention all-hdd setup,

Actions

Copy link

Updated by 玮文胡 over 3 years ago

Igor Fedotov wrote:

Definitely it's not expected to crash but from my experience it's a bad practice to put DB data on spinners. Even having RocksDB at SSD one might experience some issues under heavy DB load scenarios (and PG removal is one of it) - not to mention all-hdd setup,

Thanks for the response. We are planning to deploy new OSDs with SSD db device. However we still have some nodes with no SSD installed. A bit off topic, but cephadm seems does not support deploy a single OSD with standalone db device.

Actions

Copy link

Updated by Jerry Pu about 3 years ago

We also experience similar issues in our v15.2.4 clusters recently. We are doing tests. We keep writing data into pools. As used size of pools increases, pg_autoscaler would double pools pg. In many tests, some osds crashed just after pool pgs were scaled up. Many osd_op_thread's timeouts logs are observed. The underlying backing devices of OSDs are SSD.

Actions

Copy link

Updated by Igor Fedotov over 1 year ago

Status changed from New to Duplicate

Actions

Copy link

Updated by Igor Fedotov over 1 year ago

Related to Bug #53002: crash BlueStore::Onode::put from BlueStore::TransContext::~TransContext added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #47740

OSD crash when increase pg_num

Updated by Neha Ojha over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by 玮文胡 over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by 玮文胡 over 3 years ago

Updated by Jerry Pu about 3 years ago

Updated by Igor Fedotov over 1 year ago

Updated by Igor Fedotov over 1 year ago

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #47740

OSD crash when increase pg_num

Updated by Neha Ojha over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by 玮文 胡 over 3 years ago

Updated by Igor Fedotov over 3 years ago

Updated by 玮文 胡 over 3 years ago

Updated by Jerry Pu about 3 years ago

Updated by Igor Fedotov over 1 year ago

Updated by Igor Fedotov over 1 year ago

Updated by 玮文胡 over 3 years ago

Updated by 玮文胡 over 3 years ago