Project

General

Profile

Actions

Bug #18698

closed

BlueFS FAILED assert(0 == "allocate failed... wtf")

Added by Jared Watts about 7 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
BlueStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are seeing this failed assertion and crash using embedded ceph in the rook project: https://github.com/rook/rook.

Tracking issue in rook that has more details: https://github.com/rook/rook/issues/348

This is the ceph tree that we are reproducing with, it has Kraken 11.2.0 plus some fixes: https://github.com/rook/ceph/commits/b7551f136a37e895a4eb7c185a03c58aa260b2a9

Direct link to the failing assertion: https://github.com/rook/ceph/blob/b7551f136a37e895a4eb7c185a03c58aa260b2a9/src/os/bluestore/BlueFS.cc#L1779

Here's the journal output for the first instance of the failed assertion:

Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469075 I | osd3: 2017-01-26 02:23:23.469011 7ff1e4c64700 -1 bluefs _allocate allocate failed on 0x400000 min_alloc_size 0x100000
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469107 I | osd3: 2017-01-26 02:23:23.469019 7ff1e4c64700  0 bitmapalloc:dump instance 0xdf7f130
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469113 I | osd3: 2017-01-26 02:23:23.469022 7ff1e4c64700  0 bitalloc:dump_state zone 0 dump start
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469117 I | osd3: 2017-01-26 02:23:23.469024 7ff1e4c64700  0 bitalloc:0:: 0x87870c3861c3861c
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469121 I | osd3: 2017-01-26 02:23:23.469026 7ff1e4c64700  0 bitalloc:1:: 0x3870e1c3870e1870
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469124 I | osd3: 2017-01-26 02:23:23.469027 7ff1e4c64700  0 bitalloc:2:: 0xe1870e1870c3870e
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469127 I | osd3: 2017-01-26 02:23:23.469029 7ff1e4c64700  0 bitalloc:3:: 0x10c2108610861086
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469131 I | osd3: 2017-01-26 02:23:23.469030 7ff1e4c64700  0 bitalloc:4:: 0x1086108430843084
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469134 I | osd3: 2017-01-26 02:23:23.469031 7ff1e4c64700  0 bitalloc:5:: 0x3084308610861086
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469138 I | osd3: 2017-01-26 02:23:23.469032 7ff1e4c64700  0 bitalloc:6:: 0x1086108610e1c386
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469147 I | osd3: 2017-01-26 02:23:23.469034 7ff1e4c64700  0 bitalloc:7:: 0x1c3870e1870e1c38
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469151 I | osd3: 2017-01-26 02:23:23.469035 7ff1e4c64700  0 bitalloc:8:: 0x70c3870c3870e187
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469155 I | osd3: 2017-01-26 02:23:23.469037 7ff1e4c64700  0 bitalloc:9:: 0xffffffffffffffff
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469158 I | osd3: 2017-01-26 02:23:23.469038 7ff1e4c64700  0 bitalloc:10:: 0xffffffffffffffff
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469161 I | osd3: 2017-01-26 02:23:23.469039 7ff1e4c64700  0 bitalloc:11:: 0xffffffffffffffff
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469164 I | osd3: 2017-01-26 02:23:23.469041 7ff1e4c64700  0 bitalloc:12:: 0xffffffffffffffff
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469168 I | osd3: 2017-01-26 02:23:23.469042 7ff1e4c64700  0 bitalloc:13:: 0xffffffffffffffff
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469171 I | osd3: 2017-01-26 02:23:23.469044 7ff1e4c64700  0 bitalloc:14:: 0xffffffffffffffff
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469174 I | osd3: 2017-01-26 02:23:23.469046 7ff1e4c64700  0 bitalloc:15:: 0xffffffffffffffff
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469177 I | osd3: 2017-01-26 02:23:23.469047 7ff1e4c64700  0 bitalloc:dump_state zone 0 dump end
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469188 I | osd3: 2017-01-26 02:23:23.469141 7ff1e4c64700 -1 /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, mempool::bluefs::vector<bluefs_extent_t>*)' thread 7ff1e4c64700 time 2017-01-26 02:23:23.469053
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469193 I | osd3: /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc: 1779: FAILED assert(0 == "allocate failed... wtf")
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469196 I | osd3:
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469200 I | osd3:  ceph version 11.0.2-8651-gb7551f1 (b7551f136a37e895a4eb7c185a03c58aa260b2a9)
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469205 I | osd3:  1: [0x102c930]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469209 I | osd3:  2: [0x1a7bcbe]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469213 I | osd3:  3: [0x1a8280b]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469218 I | osd3:  4: [0x1a8494e]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469223 I | osd3:  5: [0x1a70e12]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469226 I | osd3:  6: [0x21141e9]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469229 I | osd3:  7: [0x2115570]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469233 I | osd3:  8: [0x20473e3]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469235 I | osd3:  9: [0x2047faa]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469238 I | osd3:  10: [0x171cee0]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469242 I | osd3:  11: [0x1a30b2c]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469245 I | osd3:  12: [0x1a6aa8d]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469248 I | osd3:  13: [0xc2a015]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469250 I | osd3:  14: [0x235602b]
Jan 25 18:23:23 swag rkt[1698]: 2017-01-26 02:23:23.469254 I | osd3:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Manually recreating the call stack from the above addresses gives us:

> addr2line -e ./rookd 0x102c930 0x1a7bcbe 0x1a8280b 0x1a8494e 0x1a70e12 0x21141e9 0x2115570 0x20473e3 0x2047faa 0x171cee0 0x1a30b2c 0x1a6aa8d 0xc2a015 0x235602b
/usr/include/c++/6/sstream:625 (discriminator 3)
/home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc:1756
/home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc:1336
/usr/x86_64-linux-gnu/include/boost/smart_ptr/intrusive_ptr.hpp:198
/usr/include/c++/6/bits/std_mutex.h:231
/build/external/build/x86_64-linux-gnu/src/rocksdb/./include/rocksdb/status.h:279
/build/external/build/x86_64-linux-gnu/src/rocksdb/./include/rocksdb/status.h:284
/build/external/build/x86_64-linux-gnu/src/rocksdb/./include/rocksdb/status.h:284
/build/external/build/x86_64-linux-gnu/src/rocksdb/db/db_impl.cc:4575
/home/rook/go/src/github.com/rook/rook/ceph/src/kv/RocksDBStore.cc:487
/home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueStore.cc:6748
/home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueStore.h:1496
/build/glibc-jxM2Ev/glibc-2.24/nptl/pthread_create.c:333
??:?

After the initial crash, restarting the OSD immediately runs into the assertion again on the OSD::init path, with a callstack of:

#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x0000000001866d6c in reraise_fatal (signum=6) at /home/rook/go/src/github.com/rook/rook/ceph/src/global/signal_handler.cc:72
#2  handle_fatal_signal (signum=6) at /home/rook/go/src/github.com/rook/rook/ceph/src/global/signal_handler.cc:134
#3  <signal handler called>
#4  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:58
#5  0x00000000022d5b8a in abort ()
#6  0x000000000102cb1b in ceph::__ceph_assert_fail (assertion=assertion@entry=0x27d25bb "0 == \"allocate failed... wtf\"", 
    file=file@entry=0x27d1c10 "/home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc", line=line@entry=1779, 
    func=func@entry=0x27d30a0 <BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, mempool::pool_allocator<(mempool::pool_index_t)8, bluefs_extent_t> >*)::__PRETTY_FUNCTION__> "int BlueFS::_allocate(uint8_t, uint64_t, mempool::bluefs::vector<bluefs_extent_t>*)")
    at /home/rook/go/src/github.com/rook/rook/ceph/src/common/assert.cc:78
#7  0x0000000001a7bcbe in BlueFS::_allocate (this=this@entry=0xcb76000, id=0 '\000', len=<optimized out>, ev=<optimized out>)
    at /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc:1779
#8  0x0000000001a8280b in BlueFS::_flush_and_sync_log (this=this@entry=0xcb76000, l=..., want_seq=want_seq@entry=249204, jump_to=jump_to@entry=0)
    at /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc:1335
#9  0x0000000001a8494e in BlueFS::_fsync (this=this@entry=0xcb76000, h=h@entry=0xc710b40, l=...) at /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.cc:1715
#10 0x0000000001a70e12 in BlueFS::fsync (h=0xc710b40, this=0xcb76000) at /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueFS.h:403
#11 BlueRocksWritableFile::Sync (this=<optimized out>) at /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueRocksEnv.cc:220
#12 0x00000000021141e9 in rocksdb::WritableFileWriter::SyncInternal (this=this@entry=0xc70b410, use_fsync=use_fsync@entry=false) at util/file_reader_writer.cc:242
#13 0x0000000002115570 in rocksdb::WritableFileWriter::Sync (this=0xc70b410, use_fsync=<optimized out>) at util/file_reader_writer.cc:213
#14 0x000000000214fcd3 in rocksdb::BuildTable (dbname=..., env=0xc9a2de0, ioptions=..., mutable_cf_options=..., env_options=..., table_cache=0xca5b580, iter=0x7fff3262d568, 
    range_del_iter=..., meta=0x7fff3262d240, internal_comparator=..., int_tbl_prop_collector_factories=0xca645f0, column_family_id=0, column_family_name=..., snapshots=..., 
    earliest_write_conflict_snapshot=72057594037927935, compression=<incomplete type>, compression_opts=..., paranoid_file_checks=false, internal_stats=0xc8d0f00, 
    reason=rocksdb::TableFileCreationReason::kRecovery, event_logger=0xca60b18, job_id=1, io_priority=rocksdb::Env::IO_HIGH, table_properties=0x0, level=-1) at db/builder.cc:176
#15 0x000000000203fa81 in rocksdb::DBImpl::WriteLevel0TableForRecovery (this=this@entry=0xca60000, job_id=1, cfd=cfd@entry=0xca64580, mem=0xca62400, edit=edit@entry=0xcf722b0)
    at db/db_impl.cc:1779
#16 0x000000000204266f in rocksdb::DBImpl::RecoverLogFiles (this=this@entry=0xca60000, log_numbers=..., next_sequence=next_sequence@entry=0x7fff3262e348, 
    read_only=read_only@entry=false) at db/db_impl.cc:1697
#17 0x0000000002043474 in rocksdb::DBImpl::Recover (this=this@entry=0xca60000, column_families=..., read_only=read_only@entry=false, 
    error_if_log_file_exist=error_if_log_file_exist@entry=false, error_if_data_exists_in_logs=error_if_data_exists_in_logs@entry=false) at db/db_impl.cc:1342
#18 0x000000000204408a in rocksdb::DB::Open (db_options=..., dbname=..., column_families=..., handles=handles@entry=0x7fff3262e980, dbptr=dbptr@entry=0xc704dd8)
    at db/db_impl.cc:5938
#19 0x00000000020452f5 in rocksdb::DB::Open (options=..., dbname=..., dbptr=dbptr@entry=0xc704dd8) at db/db_impl.cc:5883
#20 0x0000000001720ef2 in RocksDBStore::do_open (this=0xc704d80, out=..., create_if_missing=<optimized out>)
    at /home/rook/go/src/github.com/rook/rook/ceph/src/kv/RocksDBStore.cc:297
#21 0x0000000001a26463 in BlueStore::_open_db (this=this@entry=0xc8ec000, create=create@entry=false)
    at /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueStore.cc:3475
#22 0x0000000001a58a2c in BlueStore::mount (this=0xc8ec000) at /home/rook/go/src/github.com/rook/rook/ceph/src/os/bluestore/BlueStore.cc:4034
#23 0x00000000016dd127 in OSD::init (this=0xc8e0000) at /home/rook/go/src/github.com/rook/rook/ceph/src/osd/OSD.cc:2042
#24 0x00000000014c1ec7 in cephd_osd (argc=<optimized out>, argv=<optimized out>) at /home/rook/go/src/github.com/rook/rook/ceph/src/ceph_osd.cc:611
#25 0x0000000000dfce65 in cephd_run_osd (argc=<optimized out>, argv=<optimized out>) at /home/rook/go/src/github.com/rook/rook/ceph/src/libcephd/libcephd.cc:237
#26 0x0000000000c18a6a in _cgo_49d41b80151a_Cfunc_cephd_run_osd (v=0xc4201dbb28) at /home/rook/go/src/github.com/rook/rook/pkg/cephmgr/cephd/cephd.go:204
#27 0x00000000004f3fe0 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:590
#28 0x0000000000498d5d in runtime.cgocall (fn=0x7fff32632860, arg=0x3271580 <runtime.g0>, ~r2=845359184) at /usr/local/go/src/runtime/cgocall.go:115
#29 0x0000000003271500 in runtime.work ()
#30 0x00007fff32632860 in ?? ()
#31 0x0000000003271580 in ?? ()
#32 0x00007fff32632850 in ?? ()
#33 0x00000000004c6bd4 in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1096
#34 0x00000000004f2135 in runtime.rt0_go () at /usr/local/go/src/runtime/asm_amd64.s:156
#35 0x000000000000000b in ?? ()
#36 0x00007fff326329b8 in ?? ()
#37 0x000000000000000b in ?? ()
#38 0x00007fff326329b8 in ?? ()
#39 0x0000000000400598 in __rela_iplt_start ()
#40 0x00000000022c4b06 in generic_start_main ()
#41 0x00000000022c4d8e in __libc_start_main ()
#42 0x000000000049351a in _start ()

On the repro machine, the OSDs are configured to use bluestore, with each Block partition on its own HDD and all WAL/DB partitions on a single NVMe device, like so:

$ lsblk -o name,type,size,partlabel,partuuid
NAME         TYPE   SIZE PARTLABEL                            PARTUUID
sde          disk   3.7T                                      
`-sde1       part   3.7T ROOK-OSD7-BLOCK                      916755c5-5c12-45ca-a3b0-7dee933411d2
sdf          disk   3.7T                                      
`-sdf1       part   3.7T ROOK-OSD2-BLOCK                      9228fab8-2165-4eaa-88a3-bc3802c6fdf5
sdg          disk   3.7T                                      
`-sdg1       part   3.7T ROOK-OSD3-BLOCK                      b8f5f7aa-e9ec-41ac-81d8-2659e2f67512
sdh          disk   3.7T                                      
`-sdh1       part   3.7T ROOK-OSD8-BLOCK                      1a819a7c-8678-4fb9-b27a-dacd18e451a7
sdi          disk   3.7T                                      
`-sdi1       part   3.7T ROOK-OSD9-BLOCK                      849e9642-1f4d-4352-9f9c-da565a7acc6e
sdj          disk   3.7T                                      
`-sdj1       part   3.7T ROOK-OSD4-BLOCK                      c11246b2-9193-4314-bc60-5ba94c1f76cc
sdk          disk   3.7T                                      
`-sdk1       part   3.7T ROOK-OSD5-BLOCK                      0362827a-2dc9-4a2f-94a8-e6aedf356e82
sdl          disk   3.7T                                      
`-sdl1       part   3.7T ROOK-OSD6-BLOCK                      fede317b-6f42-4dc4-a50f-cfc8c4b7bcea
sdm          disk   3.7T                                      
`-sdm1       part   3.7T ROOK-OSD0-BLOCK                      cf6594a1-7e43-463c-8241-efa99e14c0e5
sdn          disk   3.7T                                      
`-sdn1       part   3.7T ROOK-OSD1-BLOCK                      ed95ec5c-3ac2-43c1-b0b1-beb839eb376f
sdo          disk   3.7T                                      
`-sdo1       part   3.7T ROOK-OSD10-BLOCK                     7e29beb2-71e4-4af5-86f0-0838431b9285
sdp          disk   3.7T                                      
`-sdp1       part   3.7T ROOK-OSD11-BLOCK                     1b151575-1a52-46f8-a424-333f8da61f0a
loop0        loop 233.4M                                      
nvme0n1      disk 372.6G                                      
|-nvme0n1p1  part   576M ROOK-OSD0-WAL                        ac56ce5f-d880-4334-be35-61938d29a5d3
|-nvme0n1p2  part    30G ROOK-OSD0-DB                         f8425d9e-7e79-4272-9777-8e343053f694
|-nvme0n1p3  part   576M ROOK-OSD1-WAL                        5ed92eac-fae0-42d2-9bf4-fab2a0592eb6
|-nvme0n1p4  part    30G ROOK-OSD1-DB                         95c58666-bacd-4dd8-a15b-897a217018b9
|-nvme0n1p5  part   576M ROOK-OSD2-WAL                        5a9a3c43-865f-44bf-9a7c-4dd30fbb85ac
|-nvme0n1p6  part    30G ROOK-OSD2-DB                         d7d09b47-6051-4b6a-8dc4-20985e19b61c
|-nvme0n1p7  part   576M ROOK-OSD3-WAL                        9d81f282-4b5b-418f-9123-9503bfd0912c
|-nvme0n1p8  part    30G ROOK-OSD3-DB                         9e8eb447-883e-4dc8-b50d-d593b60325df
|-nvme0n1p9  part   576M ROOK-OSD4-WAL                        954896bb-4fe5-46b9-a361-54d165d57a1c
|-nvme0n1p10 part    30G ROOK-OSD4-DB                         7d7aff85-0145-414c-81a4-0aabebcad406
|-nvme0n1p11 part   576M ROOK-OSD5-WAL                        febd37af-6771-4bae-a68e-dddfd0d58edd
|-nvme0n1p12 part    30G ROOK-OSD5-DB                         49f749a5-590c-4b75-8ebd-796ed2818f82
|-nvme0n1p13 part   576M ROOK-OSD6-WAL                        cb40e1b0-efee-4b65-acc2-1141c217d3ee
|-nvme0n1p14 part    30G ROOK-OSD6-DB                         f8226b6c-9ba7-4d39-8748-f61ca50fd2f3
|-nvme0n1p15 part   576M ROOK-OSD7-WAL                        cc365c84-5842-4438-a39e-03c96af294b5
|-nvme0n1p16 part    30G ROOK-OSD7-DB                         8a2aa293-04d5-4516-838e-8200bc314aa5
|-nvme0n1p17 part   576M ROOK-OSD8-WAL                        2ea2676b-58ed-4907-bc9e-3cdf66fdb26a
|-nvme0n1p18 part    30G ROOK-OSD8-DB                         39997dd7-4e41-4a1d-aea7-620d3ead6e38
|-nvme0n1p19 part   576M ROOK-OSD9-WAL                        aad24cf5-652f-4046-8159-66c018cc3865
|-nvme0n1p20 part    30G ROOK-OSD9-DB                         dc8f3373-d9f0-4395-bd35-9078e02b14f9
|-nvme0n1p21 part   576M ROOK-OSD10-WAL                       50ebeb5a-83be-4795-9c5f-1ec53d96039b
|-nvme0n1p22 part    30G ROOK-OSD10-DB                        09a57ea0-5b26-4acf-a698-7ab9d1787a1d
|-nvme0n1p23 part   576M ROOK-OSD11-WAL                       c112ba32-bda7-433b-bda6-668ca2bc7155
`-nvme0n1p24 part    30G ROOK-OSD11-DB                        c15304db-a39c-4adc-9b40-bdad6d144a9c

An example ceph config file for one of the OSDs:

# cat rookcluster.config 
[global]
enable experimental unrecoverable data corrupting features = bluestore rocksdb
fsid                                                       = be1427c9-a0eb-491e-9288-daaf4341e5eb
run dir                                                    = /var/lib/rook/osd0
mon initial members                                        = mon0
log file                                                   = /dev/stdout
mon cluster log file                                       = /dev/stdout
mon keyvaluedb                                             = rocksdb
debug default                                              = 0
debug rados                                                = 0
debug mon                                                  = 0
debug osd                                                  = 0
debug bluestore                                            = 5/5
debug filestore                                            = 0
debug journal                                              = 0
debug leveldb                                              = 0
filestore_omap_backend                                     = rocksdb
osd pg bits                                                = 11
osd pgp bits                                               = 11
osd pool default size                                      = 1
osd pool default min size                                  = 1
osd pool default pg num                                    = 100
osd pool default pgp num                                   = 100
osd objectstore                                            = bluestore
rbd_default_features                                       = 3
crushtool                                                  = 
debug bluefs                                               = 5/5

[osd.0]
keyring                  = /var/lib/rook/osd0/keyring
bluestore block wal path = /dev/disk/by-partuuid/ac56ce5f-d880-4334-be35-61938d29a5d3
bluestore block db path  = /dev/disk/by-partuuid/f8425d9e-7e79-4272-9777-8e343053f694
bluestore block path     = /dev/disk/by-partuuid/cf6594a1-7e43-463c-8241-efa99e14c0e5

[mon.mon0]
name     = mon0
mon addr = 10.20.235.7:6790

Full ceph log

The full journal output (ceph log) can be found at: https://gist.github.com/jbw976/3bf47564f1d1ef71e8c41fa7121b793e


Files

18698.osd.log (394 KB) 18698.osd.log Aleksei Gutikov, 12/20/2017 01:05 PM
Actions #1

Updated by Jared Watts about 7 years ago

I also have core-files and full symbols but those are hundreds of MB's. I'd be happy to share those as needed.

Actions #2

Updated by Jared Watts about 7 years ago

Running the repro scenario with `bluefs_allocator = stupid` did not reproduce the issue after running all night. It does seem to be a bug with the bluefs bitmap allocator.

Actions #3

Updated by François Blondel about 7 years ago

Hi !
I also got that issue. I also added "bluefs_allocator = stupid" in my /etc/ceph/ceph.conf. Worked.
Issue happened to me when I started using a separated block.wal block device.

Question: Is this the "normal" behaviour ?
Please ask if I somehow may run tests to help debugging.

Actions #4

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Can't reproduce

I haven't seen this in any our qa... is it still happening for you? Which versions?

Actions #5

Updated by François Blondel almost 7 years ago

Hi,

i tested with kraken v11.2.0 again : deactivated the bluefs_allocator = stupid and restarted all my OSDs, issue was still there for some OSDs.

Upgraded from kraken to luminous 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e). Same issue happened during upgrade (when restarting the OSDs after the apt upgrade), had to put again the bluefs_allocator = stupid into the ceph.conf for some OSDs to make the OSDs start again. Worked.
I then removed again the bluefs_allocator = stupid from the config and restarted the affected osds anew.
It is for now running fine with Luminous 12.0.2.

Actions #6

Updated by François Blondel almost 7 years ago

Hi,
in fact, we are still (after some days) having this issue after upgrading to Luminous 12.0.2.
Same errors in the log files. Fixed the same way using bluefs_allocator = stupid

OSDs were created with Kraken 11.2.0 using:
ceph-disk prepare --zap-disk --bluestore --cluster ceph --cluster-uuid 4b7024fa-485d-4aaf-b3e9-4f955335ca98 --block.wal /dev/sda /dev/sdd

Let me know what I should do/test!

Actions #7

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (107)
Actions #8

Updated by Greg Farnum almost 7 years ago

  • Component(RADOS) BlueStore added
Actions #9

Updated by Aleksei Gutikov over 6 years ago

Reproduced with 12.2.2 during deep scrubbing after 7 days of workload.

36 ssds, 200G each contain 400G of rbds and 500G of small files each 5kb.
Almost all ssds used above 50%.

After 7 days of testing started deep scrubbing.
As I understand that leads to bluestore rebalancing.
And then crash occurs:

bluestore(/var/lib/ceph/osd/ceph-45) _balance_bluefs_freespace allocate failed on 0x76c00000 min_alloc_size 0x1000
...
... very very long stupidalloc dump 
...
 2: (()+0x11390) [0x7fb052f47390]
 3: (gsignal()+0x38) [0x7fb051ee2428]
 4: (abort()+0x16a) [0x7fb051ee402a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5641f18d9a2e]
 6: (BlueStore::_balance_bluefs_freespace(std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*)+0x1b21) [0x5641f176b5c1]
 7: (BlueStore::_kv_sync_thread()+0x1ac0) [0x5641f176e040]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x5641f17b1f8d]
 9: (()+0x76ba) [0x7fb052f3d6ba]
10: (clone()+0x6d) [0x7fb051fb43dd]

As I understand _balance_bluefs_freespace try to allocate space (~2G) for compacted metadata, but due to fragmentation it can't,
and can't handle this.

Will check if it can be eliminated by setting bluestore_max_alloc_size...

Actions #10

Updated by Igor Fedotov over 6 years ago

There is a PR#18494 addressing an issue with the symptoms similar to ones reported in comment #9 (assert during _balance_bluefs_freespace). Issue analysis can be found there as well. Unfortunately still pending since that's not an ideal solution.
Not sure original issue has the same root cause though.

Actions

Also available in: Atom PDF