Project

General

Profile

Bug #39334

OSD crashed in BitmapAllocator::init_add_free()

Added by Neha Ojha almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic, luminous, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-04-16 18:01:39.898 7f7902619080  1 bluestore(/var/lib/ceph/osd/ceph-3/) _open_alloc opening allocation metadata
2019-04-16 18:01:39.901 7f7902619080 -1 *** Caught signal (Aborted) **
 in thread 7f7902619080 thread_name:ceph-osd

 ceph version 14.2.0-142-g2f9c072 (2f9c0720b5aed4c9e25e8b050e71856df0a986ad) nautilus (stable)
 1: (()+0x12d80) [0x7f78ff1abd80]
 2: (gsignal()+0x10f) [0x7f78fde8693f]
 3: (abort()+0x127) [0x7f78fde70c95]
 4: (()+0x65ca48) [0x55c0cc7f8a48]
 5: (BitmapAllocator::init_add_free(unsigned long, unsigned long)+0x857) [0x55c0ccdf6a87]
 6: (BlueStore::_open_alloc()+0x193) [0x55c0ccca0ae3]
 7: (BlueStore::_open_db_and_around(bool)+0xa6) [0x55c0cccc25b6]
 8: (BlueStore::_fsck(bool, bool)+0x587) [0x55c0cccf59d7]
 9: (BlueStore::mkfs()+0x141f) [0x55c0ccd0564f]
 10: (OSD::mkfs(CephContext*, ObjectStore*, uuid_d, int)+0x1ae) [0x55c0cc818d7e]
 11: (main()+0x1bd1) [0x55c0cc7110c1]
 12: (__libc_start_main()+0xf3) [0x7f78fde72813]
 13: (_start()+0x2e) [0x55c0cc7f72fe]

ceph-osd.3.log.failure (102 KB) Neha Ojha, 04/16/2019 07:06 PM

ceph-volume.log View - ceph-volume log (736 KB) Alexander Calhoun, 04/16/2019 08:02 PM

ceph-osd.1.log View - ceph osd1 log (102 KB) Alexander Calhoun, 04/16/2019 08:02 PM

ceph-osd.0.log View - ceph osd0 log (407 KB) Alexander Calhoun, 04/16/2019 08:02 PM

ceph-ansible-output-part4.txt View - portion of ceph-anisble log with error (592 KB) Alexander Calhoun, 04/16/2019 08:09 PM

ceph-volume.log View (119 KB) Alexander Calhoun, 04/17/2019 11:37 AM

ceph-osd.1.log View (127 KB) Alexander Calhoun, 04/17/2019 11:37 AM

ceph-ansible-output-logging-part2.txt View (682 KB) Alexander Calhoun, 04/17/2019 11:37 AM

ceph-ansible-output-logging-new-part3.txt View (45 KB) Alexander Calhoun, 04/18/2019 05:15 PM

ceph-volume.log View (119 KB) Alexander Calhoun, 04/18/2019 05:15 PM

ceph-osd.2.log View (127 KB) Alexander Calhoun, 04/18/2019 05:15 PM


Related issues

Copied to bluestore - Backport #39444: luminous: OSD crashed in BitmapAllocator::init_add_free() Resolved
Copied to bluestore - Backport #39445: mimic: OSD crashed in BitmapAllocator::init_add_free() Resolved
Copied to bluestore - Backport #39446: nautilus: OSD crashed in BitmapAllocator::init_add_free() Resolved

History

#1 Updated by Igor Fedotov almost 5 years ago

  • Status changed from New to Need More Info

Could you please collect the log with debug bluestore = 20

Any other information about the issue:
Is it reproducible?
Any specific test case?

#2 Updated by Alexander Calhoun almost 5 years ago

Issue occurs when install nautilus (ceph version 14.2.0-142-g2f9c072) on RHEL 8 using ceph-ansible. Issue occurs on every attempt.

Attaching ceph-ansible output, osd 0 and 1 logs, and ceph-volume logs

debug bluestore was set to 20

all.yml
---
dummy:
fetch_directory: /ceph-ansible-keys
ceph_origin: distro
ceph_rhcs_version: 4
monitor_interface: ens7f0
journal_size: 5120 # OSD journal size in MB
public_network: 10.1.0.0/16
cluster_network: 10.2.0.0/16
osd_objectstore: bluestore
ceph_conf_overrides:
mon:
mon_allow_pool_delete: true
default:
debug_bluestore: 20

osd.yml
#--------------------------------------------------------------------
  1. use 'ceph-volume lvm batch' mode
    osd_scenario: lvm
    osds_per_device: 2
    devices:
    - /dev/nvme0n1
    - /dev/nvme1n1
    - /dev/nvme2n1
    - /dev/nvme3n1
    - /dev/nvme4n1

generated ceph.conf
[default]
debug_bluestore = 20

  1. Please do not change this file directly since it is managed by Ansible and will be overwritten
    [global]
    cluster network = 10.2.0.0/16
    fsid = b2987ad0-ea8c-4c34-b1cc-28c50fe789cf
    mon host = [v2:10.1.24.17:3300,v1:10.1.24.17:6789],[v2:10.1.24.19:3300,v1:10.1.24.19:6789],[v2:10.1.24.21:3300,v1:10.1.24.21:6789]
    public network = 10.1.0.0/16

[mon]
mon_allow_pool_delete = True

[osd]
osd memory target = 14082511994

#3 Updated by Igor Fedotov almost 5 years ago

@Alexander, sorry but you weren't successful in setting 'debug bluestore = 20'
see the following lines at the end of your logs...
...
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
...

#4 Updated by Alexander Calhoun almost 5 years ago

re-ran with confirmed logging level for bluestore

#5 Updated by Igor Fedotov almost 5 years ago

@Alexander, thanks for the update.

Looks like you have a custom Ceph build.
At github tag v14.2.0 ends with commit 3a54b2b6d167d4a2a19e003a705696d4fe619afc

But in you log I can see
ceph version 14.2.0-142-g2f9c072 (2f9c0720b5aed4c9e25e8b050e71856df0a986ad)
which indicates later commit.

So now the question is whether your build contains the following PR: https://github.com/ceph/ceph/pull/27139

Please add if not or (even better) try against the HEAD nautilus build.

#6 Updated by Sage Weil almost 5 years ago

It does contain 7f9543c11f1d7d81162f93b996a6f95656cc1a01 (the merge commit).

#7 Updated by Alexander Calhoun almost 5 years ago

received updated ceph version (14.2.0-299-g167dbbf (167dbbfbf80acbc7fdcbd3d204d941d24dc6c788) nautilus (stable))

issue still occuring, attaching logs

#8 Updated by Sage Weil almost 5 years ago

(gdb) bt
#0  0x00007fdec18fac1f in raise () from /lib64/libpthread.so.0
#1  0x0000562cd59856c3 in reraise_fatal (signum=6) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/global/signal_handler.cc:326
#2  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  0x00007fdec05d593f in raise () from /lib64/libc.so.6
#5  0x00007fdec05bfc95 in abort () from /lib64/libc.so.6
#6  0x0000562cd5361458 in std::__replacement_assert (__file=__file@entry=0x562cd607b5a8 "/usr/include/c++/8/bits/stl_vector.h", __line=__line@entry=932, 
    __function=__function@entry=0x562cd611d8a0 <std::vector<unsigned long, mempool::pool_allocator<(mempool::pool_index_t)1, unsigned long> >::operator[](unsigned long)::__PRETTY_FUNCTION__> "std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = long unsigned int; _Alloc = mempool::pool_allocator<(mempool::pool_index_t)1, "..., __condition=__condition@entry=0x562cd607ba28 "__builtin_expect(__n < this->size(), true)") at /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h:2391
#7  0x0000562cd595f487 in std::vector<unsigned long, mempool::pool_allocator<(mempool::pool_index_t)1, unsigned long> >::operator[] (__n=<optimized out>, this=0x562ce0c99140) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/log/Entry.h:35
#8  AllocatorLevel01Loose::_mark_free_l0 (l0_pos_end=195297280, l0_pos_start=1, this=0x562ce0c99138) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/fastbmap_allocator_impl.h:355
#9  AllocatorLevel01Loose::_mark_free_l1_l0 (l0_pos_end=195297280, l0_pos_start=1, this=0x562ce0c99138) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/fastbmap_allocator_impl.h:365
#10 AllocatorLevel01Loose::_free_l1 (len=<optimized out>, offs=16384, this=0x562ce0c99138) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/fastbmap_allocator_impl.h:424
#11 AllocatorLevel02<AllocatorLevel01Loose>::_mark_free (len=<optimized out>, o=16384, this=0x562ce0c99108) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/fastbmap_allocator_impl.h:762
#12 BitmapAllocator::init_add_free (this=0x562ce0c99100, offset=<optimized out>, length=<optimized out>) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/BitmapAllocator.cc:67
#13 0x0000562cd5809393 in BlueStore::_open_alloc (this=0x562ce1aee000) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/BlueStore.cc:4866
#14 0x0000562cd582ae66 in BlueStore::_open_db_and_around (this=0x562ce1aee000, read_only=<optimized out>) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/BlueStore.cc:5251
#15 0x0000562cd585e297 in BlueStore::_fsck (this=0x562ce1aee000, deep=false, repair=<optimized out>) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/BlueStore.cc:6947
#16 0x0000562cd586dfbf in BlueStore::fsck (deep=<optimized out>, this=0x562ce1aee000) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/BlueStore.h:2426
#17 BlueStore::mkfs (this=0x562ce1aee000) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/os/bluestore/BlueStore.cc:6101
#18 0x0000562cd53817ce in OSD::mkfs (cct=0x562ce0fa8000, store=store@entry=0x562ce1aee000, fsid=..., whoami=whoami@entry=1) at /usr/src/debug/ceph-14.2.0-299.g167dbbf.el8cp.x86_64/src/osd/OSD.cc:2040
#19 0x0000562cd5279aa1 in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/8/ext/new_allocator.h:79

log:

   -14> 2019-04-18 17:13:48.963 7fdec4d69080 10 bluestore(/var/lib/ceph/osd/ceph-1/) _set_blob_size max_blob_size 0x10000
   -13> 2019-04-18 17:13:48.963 7fdec4d69080  1 freelist init
   -12> 2019-04-18 17:13:48.963 7fdec4d69080 10 freelist init blocks 0xba40000
   -11> 2019-04-18 17:13:48.963 7fdec4d69080 10 freelist init blocks_per_key 0x80
   -10> 2019-04-18 17:13:48.963 7fdec4d69080 10 freelist init bytes_per_block 0x4000
    -9> 2019-04-18 17:13:48.963 7fdec4d69080 10 freelist init size 0x2e900000000
    -8> 2019-04-18 17:13:48.963 7fdec4d69080 10 freelist init size 0x2e900000000 bytes_per_block 0x4000 blocks 0xba40000 blocks_per_key 0x80
    -7> 2019-04-18 17:13:48.963 7fdec4d69080 10 freelist _init_misc bytes_per_key 0x200000, key_mask 0xffffffffffe00000
    -6> 2019-04-18 17:13:48.963 7fdec4d69080 10 bluestore(/var/lib/ceph/osd/ceph-1/) _open_alloc bluefs extents 0x[16599900000~1dccd00000]
    -5> 2019-04-18 17:13:48.963 7fdec4d69080 10 fbmap_alloc 0x562ce0c99100 BitmapAllocator 0x2e900000000/4000
    -4> 2019-04-18 17:13:48.968 7fdec4d69080  1 bluestore(/var/lib/ceph/osd/ceph-1/) _open_alloc opening allocation metadata
    -3> 2019-04-18 17:13:48.968 7fdec4d69080 10 freelist enumerate_next start
    -2> 2019-04-18 17:13:48.968 7fdec4d69080 10 freelist enumerate_next 0x4000~2e8ffffc000
    -1> 2019-04-18 17:13:48.968 7fdec4d69080 10 fbmap_alloc 0x562ce0c99100 init_add_free 0x4000~2e8ffffc000
     0> 2019-04-18 17:13:48.974 7fdec4d69080 -1 *** Caught signal (Aborted) **

#9 Updated by Sage Weil almost 5 years ago

looks like an off-by-one perhaps?

(gdb) p l0_pos_end
$4 = 195297280
(gdb) p l0
$5 = std::vector of length 3051520, capacity 3051520 = {18446744073709551614, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 
  18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615...}
(gdb) p pos_e
$6 = <optimized out>
(gdb) p l0_pos_end / d0
$7 = 3051520

it looks like the code lets pos reach pos_e, or l0_pos_end.. which is not inclusive, right? i..e, [l0_pos_start,l0_pos_end) ? because l0_pos_end / d0 == the vector size, so we're accessing the item just past the end.

#10 Updated by Igor Fedotov almost 5 years ago

  • Status changed from Need More Info to In Progress
  • Assignee set to Igor Fedotov

Sage Weil wrote:

looks like an off-by-one perhaps?
[...]
it looks like the code lets pos reach pos_e, or l0_pos_end.. which is not inclusive, right? i..e, [l0_pos_start,l0_pos_end) ? because l0_pos_end / d0 == the vector size, so we're accessing the item just past the end.

@Sage - yep, you're right. Fixing...

Thanks a lot, this is very helpful.
(Looks like I have different gcc and/or stdc++ which doesn't detect out of vector index access)

#11 Updated by Igor Fedotov almost 5 years ago

  • Backport set to mimic, luminous, nautilus

#12 Updated by Igor Fedotov almost 5 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 27691

#13 Updated by Sage Weil almost 5 years ago

  • Status changed from Fix Under Review to Pending Backport

#14 Updated by Igor Fedotov almost 5 years ago

  • Copied to Backport #39444: luminous: OSD crashed in BitmapAllocator::init_add_free() added

#15 Updated by Igor Fedotov almost 5 years ago

  • Copied to Backport #39445: mimic: OSD crashed in BitmapAllocator::init_add_free() added

#16 Updated by Igor Fedotov almost 5 years ago

  • Copied to Backport #39446: nautilus: OSD crashed in BitmapAllocator::init_add_free() added

#17 Updated by Igor Fedotov over 4 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF