Project

General

Profile

Bug #57895

OSD crash in Onode::put()

Added by dongdong tao over 1 year ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This issue happens when an Onode is being trimmed right away after it's unpinned. This is possible when the LRU list is extremely short

Below are the crash stacks (happened on unpin and trim thread):

1: (()+0x12890) [0x7f74d588a890]
2: (ceph::buffer::v15_2_0::ptr::release()+0x8) [0x555c649a9e18]
3: (BlueStore::Onode::put()+0x1c1) [0x555c6462c621]
4: (std::__detail::_Hashtable_alloc&lt;mempool::pool_allocator<(mempool::pool_index_t)4, std::__detail::_Hash_node&lt;std::pair&lt;ghobject_t const, boost::intrusive_ptr&lt;BlueStore::Onode&gt; >, true> > >::_M_deallocate_node(std::__detail::_Hash_node&lt;std::pair&lt;ghobject_t const, boost::intrusive_ptr&lt;BlueStore::Onode&gt; >, true>)+0x35) [0x555c646dc3c5]
5: (std::_Hashtable&lt;ghobject_t, std::pair&lt;ghobject_t const, boost::intrusive_ptr&lt;BlueStore::Onode&gt; >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair&lt;ghobject_t const, boost::intrusive_ptr&lt;BlueStore::Onode&gt; > >, std::__detail::_Select1st, std::equal_to&lt;ghobject_t&gt;, std::hash&lt;ghobject_t&gt;, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits&lt;true, false, true&gt; >::_M_erase(unsigned long, std::__detail::_Hash_node_base
, std::__detail::_Hash_node&lt;std::pair&lt;ghobject_t const, boost::intrusive_ptr&lt;BlueStore::Onode&gt; >, true>)+0x53) [0x555c646dc803]
6: (BlueStore::OnodeSpace::_remove(ghobject_t const&)+0x12c) [0x555c6462c2cc]
7: (LruOnodeCacheShard::_trim_to(unsigned long)+0xce) [0x555c646dd33e]
8: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr&lt;BlueStore::Onode&gt;&)+0x152) [0x555c6462ce22]
9: (BlueStore::Collection::get_onode(ghobject_t const&, bool, bool)+0x384) [0x555c6468d5a4]
10: (BlueStore::_txc_add_transaction(BlueStore::TransContext
, ceph::os::Transaction*)+0x1c29) [0x555c64696999]
11: (BlueStore::queue_transactions(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, std::vector&lt;ceph::os::Transaction, std::allocator&lt;ceph::os::Transaction&gt; >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0x2ae) [0x555c646afb4e]
12: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector&lt;ceph::os::Transaction, std::allocator&lt;ceph::os::Transaction&gt; >&, boost::intrusive_ptr&lt;OpRequest&gt;)+0x54) [0x555c6433af54]
13: (ReplicatedBackend::do_repop(boost::intrusive_ptr&lt;OpRequest&gt;)+0xb08) [0x555c644e5f18]
14: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x187) [0x555c644f6397]
15: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x87) [0x555c64384517]
16: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x684) [0x555c6432acd4]
17: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x159) [0x555c641b7229]
18: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x67) [0x555c6440a227]
19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x623) [0x555c641d35f3]
20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x555c64807f0c]
21: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x555c6480b160]
22: (()+0x76db) [0x7f74d587f6db]
23: (clone()+0x3f) [0x7f74d55a888f]

and

ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
1: (()+0x12890) [0x7ff0ee5fd890]
2: (ceph::buffer::v15_2_0::ptr::release()+0x8) [0x55f9c9954e18]
3: (BlueStore::Onode::put()+0x1c1) [0x55f9c95d7621]
4: (std::_Rb_tree&lt;boost::intrusive_ptr&lt;BlueStore::Onode&gt;, boost::intrusive_ptr&lt;BlueStore::Onode&gt;, std::_Identity&lt;boost::intrusive_ptr&lt;BlueStore::Onode&gt; >, std::less&lt;boost::intru
sive_ptr&lt;BlueStore::Onode&gt; >, std::allocator&lt;boost::intrusive_ptr&lt;BlueStore::Onode&gt; > >::_M_erase(std::_Rb_tree_node&lt;boost::intrusive_ptr&lt;BlueStore::Onode&gt; >)+0x2d) [0x55f9c9687
d0d]
5: (BlueStore::TransContext::~TransContext()+0x114) [0x55f9c9687e44]
6: (BlueStore::_txc_finish(BlueStore::TransContext
)+0x448) [0x55f9c9617788]
7: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x24c) [0x55f9c961907c]
8: (BlueStore::_kv_finalize_thread()+0x48c) [0x55f9c965c31c]
9: (BlueStore::KVFinalizeThread::entry()+0xd) [0x55f9c968c01d]
10: (()+0x76db) [0x7ff0ee5f26db]
11: (clone()+0x3f) [0x7ff0ee31b88f]

I believe this issue is still present on the Master branch


Related issues

Duplicates bluestore - Bug #56382: ONode ref counting is broken Resolved

History

#1 Updated by dongdong tao over 1 year ago

This is observed from 15.2.16, but I believe the code defect to cause this kind of race condition is still present on the master, I will post a patch soon.

#2 Updated by Igor Fedotov over 1 year ago

  • Duplicates Bug #56382: ONode ref counting is broken added

#3 Updated by dongdong tao over 1 year ago

Please help to review this one, https://github.com/ceph/ceph/pull/48566

Here is the related log: https://pastebin.ubuntu.com/p/2yfHFQYnyR/

#4 Updated by Yaarit Hatuka over 1 year ago

  • Status changed from New to Duplicate

Status changed from "New" to "Duplicate" since this issue duplicates https://tracker.ceph.com/issues/56382.

#5 Updated by dongdong tao over 1 year ago

Yaarit Hatuka wrote:

Status changed from "New" to "Duplicate" since this issue duplicates https://tracker.ceph.com/issues/56382.

Are we sure it's duplicated? the crash stack is the same, but the reason for the race condition might not be.
But I'm ok to discuss it anyplace, as long as we can move it forward.

#6 Updated by Igor Fedotov over 1 year ago

dongdong tao wrote:

Yaarit Hatuka wrote:

Status changed from "New" to "Duplicate" since this issue duplicates https://tracker.ceph.com/issues/56382.

Are we sure it's duplicated? the crash stack is the same, but the reason for the race condition might not be.
But I'm ok to discuss it anyplace, as long as we can move it forward.

I'm pretty sure it's a duplicate - see my description of the issue in https://github.com/ceph/ceph/pull/47702 which fixes #56382

#7 Updated by dongdong tao over 1 year ago

OK, thanks Igor for your confirmation, I'm reviewing your patch, we can discuss over there.

#8 Updated by A. Saber Shenouda over 1 year ago

dongdong tao wrote:

OK, thanks Igor for your confirmation, I'm reviewing your patch, we can discuss over there.

We have the same issue since we upgraded to 16.2.10. It's a EC rbd pool with medium usage.

We get a random osd failure (restarts auto) 2 or 3 times per day.

@timestamp message
2022-12-24T13:13:43.225Z [5335688.438044] tp_osd_tp2655970: segfault at 0 ip 00007f349c69d573 sp 00007f347a016e40 error 4 in libtcmalloc.so.4.5.3[7f349c672000+4d000]
2022-12-24T10:24:55.593Z 140336454874880 / tp_osd_tp
2022-12-24T10:24:55.593Z 140336471660288 / tp_osd_tp
2022-12-24T10:24:55.593Z 140336522016512 / tp_osd_tp
2022-12-24T10:24:55.593Z 140336538801920 / tp_osd_tp
2022-12-24T10:24:48.076Z 140336454874880 / tp_osd_tp
2022-12-24T10:24:48.076Z 140336538801920 / tp_osd_tp
2022-12-24T10:24:48.076Z 140336471660288 / tp_osd_tp
2022-12-24T10:24:48.076Z 140336522016512 / tp_osd_tp
2022-12-24T08:15:18.304Z [11627582.547179] tp_osd_tp290821: segfault at 100000000000 ip 00007fb0ef4abd6a sp 00007fb0caf0aab8 error 4 in libtcmalloc.so.4.5.3[7fb0ef470000+4d000]
2022-12-24T04:54:13.352Z 139654831949568 / tp_osd_tp
2022-12-24T04:54:13.352Z 139654865520384 / tp_osd_tp
2022-12-24T04:54:13.352Z 139654798378752 / tp_osd_tp
2022-12-24T04:54:13.352Z 139654815164160 / tp_osd_tp
2022-12-24T04:54:12.710Z 139654831949568 / tp_osd_tp
2022-12-24T04:54:12.710Z 139654865520384 / tp_osd_tp
2022-12-24T04:54:12.709Z 139654815164160 / tp_osd_tp
2022-12-24T04:54:12.709Z 139654798378752 / tp_osd_tp
2022-12-23T19:19:34.858Z [6199032.842139] traps: tp_osd_tp5328 general protection fault ip:7fb6fa59d6cb sp:7fb6d2604e58 error:0 in libtcmalloc.so.4.5.3[7fb6fa561000+4d000]
2022-12-23T15:20:28.058Z 140228391945984 / tp_osd_tp
2022-12-23T15:20:28.058Z 140228375160576 / tp_osd_tp
2022-12-23T15:20:28.058Z 140228308018944 / tp_osd_tp
2022-12-23T15:20:25.853Z 140228375160576 / tp_osd_tp
2022-12-23T15:20:25.853Z 140228391945984 / tp_osd_tp
2022-12-23T15:20:25.853Z 140228308018944 / tp_osd_tp
2022-12-23T13:04:09.255Z 140179572086528 / tp_osd_tp
2022-12-23T13:04:09.254Z 140179454588672 / tp_osd_tp
2022-12-23T13:04:09.254Z 140179538515712 / tp_osd_tp
2022-12-23T13:04:09.254Z 140179521730304 / tp_osd_tp
2022-12-23T13:04:09.254Z 140179462981376 / tp_osd_tp
2022-12-23T13:04:04.863Z 140179572086528 / tp_osd_tp
2022-12-23T13:04:04.862Z 140179462981376 / tp_osd_tp
2022-12-23T13:04:04.862Z 140179538515712 / tp_osd_tp
2022-12-23T13:04:04.862Z 140179454588672 / tp_osd_tp
2022-12-23T08:03:19.151Z [8257412.891972] traps: tp_osd_tp3370888 general protection fault ip:7fe849250d6a sp:7fe827bc2d38 error:0 in libtcmalloc.so.4.5.3[7fe849215000+4d000]
2022-12-23T04:10:54.942Z 140083659638528 / tp_osd_tp
2022-12-23T04:10:54.942Z 140083701602048 / tp_osd_tp
2022-12-23T04:10:54.942Z 140083726780160 / tp_osd_tp
2022-12-23T04:10:54.942Z 140083651245824 / tp_osd_tp
2022-12-23T04:10:54.210Z 140083651245824 / tp_osd_tp
2022-12-23T04:10:54.210Z 140083701602048 / tp_osd_tp
2022-12-23T04:10:54.210Z 140083659638528 / tp_osd_tp
2022-12-23T04:10:54.210Z 140083726780160 / tp_osd_tp
2022-12-23T02:31:19.381Z 139891808802560 / tp_osd_tp
2022-12-23T02:31:19.381Z 139891842373376 / tp_osd_tp
2022-12-23T02:31:19.381Z 139891909515008 / tp_osd_tp
2022-12-23T02:31:18.721Z 139891842373376 / tp_osd_tp
2022-12-23T02:31:18.721Z 139891808802560 / tp_osd_tp
2022-12-23T02:31:18.721Z 139891909515008 / tp_osd_tp
2022-12-23T01:04:04.999Z [7024804.461708] tp_osd_tp1395501: segfault at 0 ip 0000564ae9edeeba sp 00007f3d330f16d0 error 6 in ceph-osd[564ae9486000+1813000]
2022-12-22T22:47:44.046Z [5197336.391888] tp_osd_tp2653976: segfault at 0 ip 00007f57bdc79573 sp 00007f57965e78f0 error 4 in libtcmalloc.so.4.5.3[7f57bdc4e000+4d000]
2022-12-22T16:40:43.191Z 140558401672960 / tp_osd_tp
2022-12-22T16:40:43.191Z 140558410065664 / tp_osd_tp
2022-12-22T16:40:43.191Z 140558460421888 / tp_osd_tp
2022-12-22T16:40:43.191Z 140558477207296 / tp_osd_tp
2022-12-22T16:40:40.404Z 140558477207296 / tp_osd_tp
2022-12-22T16:40:40.403Z 140558460421888 / tp_osd_tp
2022-12-22T16:40:40.403Z 140558401672960 / tp_osd_tp
2022-12-22T16:40:40.403Z 140558410065664 / tp_osd_tp
2022-12-22T10:57:18.258Z [11455588.932453] tp_osd_tp312152: segfault at 0 ip 00007f3dda890573 sp 00007f3db29d5240 error 4 in libtcmalloc.so.4.5.3[7f3dda865000+4d000]
2022-12-21T17:10:45.694Z 140464781670144 / tp_osd_tp
2022-12-21T17:10:45.694Z 140464848811776 / tp_osd_tp
2022-12-21T17:10:45.693Z 140464731313920 / tp_osd_tp
2022-12-21T17:10:44.927Z 140464848811776 / tp_osd_tp
2022-12-21T17:10:44.927Z 140464731313920 / tp_osd_tp
2022-12-21T17:10:44.927Z 140464781670144 / tp_osd_tp
2022-12-20T23:28:33.811Z [5033357.226190] tp_osd_tp2671913: segfault at 1081caf9000 ip 00007f3a3d677597 sp 00007f3a17f985e0 error 4 in libtcmalloc.so.4.5.3[7f3a3d63c000+4d000]
2022-12-19T16:30:22.887Z 139984359474944 / tp_osd_tp
2022-12-19T16:30:22.887Z 139984351082240 / tp_osd_tp
2022-12-19T16:30:22.887Z 139984426616576 / tp_osd_tp
2022-12-19T16:30:18.580Z 139984351082240 / tp_osd_tp
2022-12-19T16:30:18.580Z 139984359474944 / tp_osd_tp
2022-12-19T16:30:18.580Z 139984426616576 / tp_osd_tp
2022-12-18T01:32:30.802Z [4771882.511313] traps: tp_osd_tp2625162 general protection fault ip:7f185819f603 sp:7f18362e2d70 error:0 in libtcmalloc.so.4.5.3[7f1858178000+4d000]
2022-12-18T00:44:54.009Z 140184194049792 / tp_osd_tp
2022-12-18T00:44:54.009Z 140184126908160 / tp_osd_tp
2022-12-18T00:44:53.606Z 140184126908160 / tp_osd_tp
2022-12-18T00:44:53.606Z 140184194049792 / tp_osd_tp

Also available in: Atom PDF