Project

General

Profile

Actions

Bug #48781

closed

crash in BlueStore::Onode::put()

Added by Gerry D over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Following the earlier issue reported in #48778, I now see frequent OSD crashes. I'm not sure both are related.

1/7/21 5:02:55 PM[INF]osd.4 [v2:10.10.20.161:6804/799912031,v1:10.10.20.161:6805/799912031] boot
1/7/21 5:02:55 PM[INF]Health check cleared: OSD_DOWN (was: 1 osds down)
1/7/21 5:02:53 PM[WRN]Health check update: Degraded data redundancy: 21365636/111512952 objects degraded (19.160%), 76 pgs degraded (PG_DEGRADED)
1/7/21 5:02:47 PM[WRN]Health check update: Degraded data redundancy: 21365611/111512841 objects degraded (19.160%), 76 pgs degraded (PG_DEGRADED)
1/7/21 5:02:41 PM[WRN]Health check update: Degraded data redundancy: 21365580/111512715 objects degraded (19.160%), 76 pgs degraded (PG_DEGRADED)
1/7/21 5:02:35 PM[WRN]Health check failed: Degraded data redundancy: 12558393/111512823 objects degraded (11.262%), 43 pgs degraded (PG_DEGRADED)

Any help in understanding what is happening is welcome


Files

crash.zip (227 KB) crash.zip Tom Myny, 01/07/2021 02:45 PM
crashosd2.zip (114 KB) crashosd2.zip Tom Myny, 02/01/2021 10:41 AM
osd2fulllog.zip (188 KB) osd2fulllog.zip osd2 logs +10000 lines Tom Myny, 02/01/2021 11:59 AM
requestextralogs.zip (125 KB) requestextralogs.zip Tom Myny, 02/01/2021 12:03 PM

Related issues 4 (0 open4 closed)

Related to bluestore - Bug #48966: FAILED ceph_assert(o->pinned) in BlueStore::Collection::split_cache(BlueStore::Collection*)Resolved

Actions
Related to bluestore - Bug #54650: crash: BlueStore::Onode::put()Duplicate

Actions
Copied to bluestore - Backport #49099: octopus: crash in BlueStore::Onode::put()ResolvedActions
Copied to bluestore - Backport #49100: pacific: crash in BlueStore::Onode::put()ResolvedActions
Actions #1

Updated by Tom Myny over 3 years ago

Here is some extra information regarding this problem:

{
"backtrace": [
"(()+0x12b20) [0x7f0afc7a8b20]",
"(BlueStore::Onode::put()+0x163) [0x55b5e22520d3]",
"(std::_Rb_tree<boost::intrusive_ptr<BlueStore::Onode>, boost::intrusive_ptr<BlueStore::Onode>, std::_Identity<boost::intrusive_ptr<BlueStore::Onode> >, std::less<boost::intrusive_ptr<BlueStore::Onode> >, std::allocator<boost::intrusive_ptr<BlueStore::Onode> > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<BlueStore::Onode> >)+0x31) [0x55b5e22feab1]",
"(BlueStore::TransContext::~TransContext()+0x11c) [0x55b5e22fed6c]",
"(BlueStore::_txc_finish(BlueStore::TransContext
)+0x23b) [0x55b5e22b7f8b]",
"(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x244) [0x55b5e22b97e4]",
"(BlueStore::_kv_finalize_thread()+0x54b) [0x55b5e22badfb]",
"(BlueStore::KVFinalizeThread::entry()+0x11) [0x55b5e2303fd1]",
"(()+0x814a) [0x7f0afc79e14a]",
"(clone()+0x43) [0x7f0afb4d5f23]"
],
"ceph_version": "15.2.8",
"crash_id": "2021-01-07T04:09:01.930936Z_d8616ead-b495-4186-ab34-b7c6feb46ce5",
"entity_name": "osd.3",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "04d43d0f0a565e3bf0f5b27c4b982970490c4cb03f1b83d00dc7747ac8b602a9",
"timestamp": "2021-01-07T04:09:01.930936Z",
"utsname_hostname": "ceph0",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-13-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.160-2 (2020-11-28)"
}

Actions #2

Updated by Tom Myny over 3 years ago

Download file in attachment with extra logs

Actions #3

Updated by Tom Myny over 3 years ago

We also see the following in our OS logs:

[119268.259883] tp_osd_tp32332: segfault at 0 ip 00007f8ccce40733 sp 00007f8ca881de70 error 4 in libtcmalloc.so.4.5.3[7f8ccce15000+4d000]
[119268.259890] Code: d9 4c 89 e7 41 29 df e8 2b bd fe ff 44 39 fb 7c a6 48 8b 75 00 48 89 f2 41 83 ff 01 7e 14 b8 01 00 00 00 0f 1f 40 00 83 c0 01 <48> 8b 12 41 39 c7 75 f5 48 8b 02 48 89 45 00 48 c7 02 00 00 00 00
[213681.771634] msgr-worker-019460: segfault at 7f183af6d010 ip 000055c3aed536db sp 00007f18387878c0 error 7 in ceph-osd[55c3ae512000+1629000]
[213681.771642] Code: 4c 3b 6d 20 0f 87 d5 03 00 00 bf 40 00 00 00 4c 89 44 24 08 e8 f6 5d c6 ff 49 8b 14 24 4c 8b 44 24 08 49 c7 04 24 00 00 00 00 <44> 89 70 10 48 89 50 18 49 8b 54 24 08 48 89 50 20 49 8b 54 24 10

Actions #4

Updated by Tom Myny over 3 years ago

On another system we see the following to:

Jan 7 10:02:32 ceph1 kernel: [114774.759038] tp_osd_tp17449: segfault at 0 ip 00007f3a1587a6e3 sp 00007f39eda4afe0 error 4 in libtcmalloc.so.4.5.3[7f3a1584f000+4d000]
Jan 7 10:02:32 ceph1 kernel: [114774.759044] Code: 1f 84 00 00 00 00 00 85 db 0f 84 b8 00 00 00 48 8b 75 00 48 89 f2 83 fb 01 7e 16 b8 01 00 00 00 0f 1f 80 00 00 00 00 83 c0 01 <48> 8b 12 39 c3 75 f6 48 8b 02 48 89 45 00 48 c7 02 00 00 00 00 8b

Actions #5

Updated by Tom Myny over 3 years ago

and on the last host:

Jan 7 07:34:17 ceph2 kernel: [107054.315343] tp_osd_tp20519: segfault at 0 ip 00007efd3db4e733 sp 00007efd15524400 error 4 in libtcmalloc.so.4.5.3[7efd3db23000+4d000]
Jan 7 07:34:17 ceph2 kernel: [107054.315355] Code: d9 4c 89 e7 41 29 df e8 2b bd fe ff 44 39 fb 7c a6 48 8b 75 00 48 89 f2 41 83 ff 01 7e 14 b8 01 00 00 00 0f 1f 40 00 83 c0 01 <48> 8b 12 41 39 c7 75 f5 48 8b 02 48 89 45 00 48 c7 02 00 00 00 00
Jan 8 13:31:33 ceph2 kernel: [214888.287632] bstore_kv_final8051: segfault at 0 ip 00007fa9b41b8733 sp 00007fa9a49b4a30 error 4 in libtcmalloc.so.4.5.3[7fa9b418d000+4d000]
Jan 8 13:31:33 ceph2 kernel: [214888.287638] Code: d9 4c 89 e7 41 29 df e8 2b bd fe ff 44 39 fb 7c a6 48 8b 75 00 48 89 f2 41 83 ff 01 7e 14 b8 01 00 00 00 0f 1f 40 00 83 c0 01 <48> 8b 12 41 39 c7 75 f5 48 8b 02 48 89 45 00 48 c7 02 00 00 00 00

Actions #6

Updated by Neha Ojha over 3 years ago

  • Project changed from RADOS to bluestore
  • Subject changed from OSD: frequent crashes to crash in BlueStore::Onode::put()
Actions #7

Updated by Igor Fedotov over 3 years ago

Could you please share yet another 10000 lines of log preceding ones from crash.zip?

Actions #8

Updated by Igor Fedotov over 3 years ago

  • Status changed from New to Need More Info
Actions #9

Updated by Tom Myny over 3 years ago

Here is a dump of our latest crash

Actions #10

Updated by Igor Fedotov over 3 years ago

Tom Myny wrote:

Here is a dump of our latest crash

@Tom Verdaat, may I have additional 10000 lines of the log preceding the crash, please?

And how often do you observe such crashes?

Actions #11

Updated by Tom Myny over 3 years ago

Here you go (output from cephadm logs)

This crash is the first one now after 1 week.

Actions #12

Updated by Tom Myny over 3 years ago

Extra logs

Actions #13

Updated by Igor Fedotov over 3 years ago

  • Status changed from Need More Info to Fix Under Review
  • Pull request ID set to 39041

@Tom Verdaat - thanks a lot.
I presume the root cause for the bug is an improper (too early) nref decrement in Onode::put method.
https://github.com/ceph/ceph/pull/39041 casually fixes this.

Actions #14

Updated by Igor Fedotov over 3 years ago

  • Backport set to pacific, octopus
Actions #15

Updated by Igor Fedotov over 3 years ago

  • Related to Bug #48966: FAILED ceph_assert(o->pinned) in BlueStore::Collection::split_cache(BlueStore::Collection*) added
Actions #16

Updated by Igor Fedotov over 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #17

Updated by Igor Fedotov over 3 years ago

  • Copied to Backport #49099: octopus: crash in BlueStore::Onode::put() added
Actions #18

Updated by Igor Fedotov over 3 years ago

  • Copied to Backport #49100: pacific: crash in BlueStore::Onode::put() added
Actions #19

Updated by Igor Fedotov about 3 years ago

  • Status changed from Pending Backport to Resolved
Actions #20

Updated by Telemetry Bot about 2 years ago

  • Related to Bug #54650: crash: BlueStore::Onode::put() added
Actions

Also available in: Atom PDF