Project

General

Profile

Bug #40080

Bitmap allocator return duplicate entries which cause interval_set assert

Added by Xiaoxi Chen about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
mimic, luminous, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

0x941300000~200000 were returned several times by allocator , later when allocate_bluefs_freespace give duplicate entries to interval_set, interval_set assert.

--- STACK---

(gdb) bt
#0 raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00005631fb8551f3 in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:81
#2 handle_fatal_signal (signum=6) at ./src/global/signal_handler.cc:298
#3 <signal handler called>
#4 _GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5 0x00007f86b5416801 in __GI_abort () at abort.c:79
#6 0x00005631fb12b831 in ceph::
_ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>,
func=0x5631fc027de0 <interval_set<unsigned long, std::map<unsigned long, unsigned long, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*, unsigned long*)::__PRETTY_FUNCTION__> "void interval_set<T, Map>::insert(T, T, T*, T*) [with T = long unsigned int; Map = std::map<long unsigned int, long unsigned int>]") at ./src/common/assert.cc:73
#7 0x00005631fb12b9bb in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:78
#8 0x00005631fb459301 in interval_set<unsigned long, std::map<unsigned long, unsigned long, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > > >::insert (this=0x5632071af360, start=<optimized out>, len=2097152, pstart=<optimized out>, plen=<optimized out>) at ./src/include/interval_set.h:490
#9 0x00005631fb6d935e in BlueStore::allocate_bluefs_freespace (this=0x5632071af000, size=<optimized out>, extents_out=0x7ffea3e8b0a0)
at ./src/os/bluestore/BlueStore.cc:5594
#10 0x00005631fb7fc299 in BlueFS::_expand_slow_device (this=this@entry=0x563207252000, need=27802992640, need@entry=153092096,
extents=std::vector of length 1160, capacity 2048 = {...}) at ./src/os/bluestore/BlueFS.cc:2377
#11 0x00005631fb7fd03b in BlueFS::_allocate (this=this@entry=0x563207252000, id=2 '\002', len=len@entry=152453959, node=node@entry=0x5632088ace58)
at ./src/os/bluestore/BlueFS.cc:2456
#12 0x00005631fb7fcd02 in BlueFS::_allocate (this=this@entry=0x563207252000, id=<optimized out>, len=len@entry=152453959, node=0x5632088ace58)
at ./src/os/bluestore/BlueFS.cc:2448
#13 0x00005631fb7fd6a2 in BlueFS::_flush_range (this=this@entry=0x563207252000, h=h@entry=0x5632074d5cc0, offset=offset@entry=0, length=length@entry=152453959)
at ./src/os/bluestore/BlueFS.cc:2039
#14 0x00005631fb7ff71c in BlueFS::_flush (this=this@entry=0x563207252000, h=h@entry=0x5632074d5cc0, force=force@entry=false) at ./src/os/bluestore/BlueFS.cc:2255
#15 0x00005631fb81ed6d in BlueFS::flush (h=0x5632074d5cc0, this=0x563207252000) at ./src/os/bluestore/BlueFS.h:483
#16 BlueRocksWritableFile::Flush (this=<optimized out>) at ./src/os/bluestore/BlueRocksEnv.cc:209
#17 0x00005631fbe82ae9 in rocksdb::WritableFileWriter::Flush (this=this@entry=0x56320850ad80) at ./src/rocksdb/util/file_reader_writer.cc:314
#18 0x00005631fbe83cdb in rocksdb::WritableFileWriter::Sync (this=this@entry=0x56320850ad80, use_fsync=<optimized out>) at ./src/rocksdb/util/file_reader_writer.cc:350
#19 0x00005631fbeb67e1 in rocksdb::BuildTable (dbname="db", env=0x5632074ddd60, ioptions=..., mutable_cf_options=..., env_options=..., table_cache=0x5632074aff40,
iter=0x7ffea3e8cbc0, range_del_iter=std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice>> = {...}, meta=0x7ffea3e8c850, internal_comparator=...,
int_tbl_prop_collector_factories=0x563207508770, column_family_id=0, column_family_name="default", snapshots=std::vector of length 0, capacity 0,
earliest_write_conflict_snapshot=72057594037927935, snapshot_checker=0x0, compression=<incomplete type>, compression_opts=..., paranoid_file_checks=false,
internal_stats=0x5632071658c0, reason=rocksdb::TableFileCreationReason::kRecovery, event_logger=0x5632072445c8, job_id=1, io_priority=rocksdb::Env::IO_HIGH,
table_properties=0x0, level=-1, creation_time=1559202550, oldest_key_time=3, write_hint=rocksdb::Env::WLTH_NOT_SET) at ./src/rocksdb/db/builder.cc:193
#20 0x00005631fbd1c897 in rocksdb::DBImpl::WriteLevel0TableForRecovery (this=this@entry=0x563207243600, job_id=1, cfd=cfd@entry=0x563207508700, mem=0x56320db32000,
edit=edit@entry=0x56320637d490) at ./src/rocksdb/db/db_impl_open.cc:995
#21 0x00005631fbd1eb41 in rocksdb::DBImpl::RecoverLogFiles (this=this@entry=0x563207243600, log_numbers=std::vector of length 2, capacity 2 = {...},
next_sequence=next_sequence@entry=0x7ffea3e8d980, read_only=read_only@entry=false) at ./src/rocksdb/db/db_impl_open.cc:854
#22 0x00005631fbd1f817 in rocksdb::DBImpl::Recover (this=this@entry=0x563207243600, column_families=std::vector of length 1, capacity 1 = {...},
read_only=read_only@entry=false, error_if_log_file_exist=error_if_log_file_exist@entry=false, error_if_data_exists_in_logs=error_if_data_exists_in_logs@entry=false)
at ./src/rocksdb/db/db_impl_open.cc:470
#23 0x00005631fbd206e8 in rocksdb::DBImpl::Open (db_options=..., dbname="db", column_families=std::vector of length 1, capacity 1 = {...},
handles=handles@entry=0x7ffea3e8e140, dbptr=dbptr@entry=0x56320636c7c0, seq_per_batch=false, batch_per_txn=true) at ./src/rocksdb/db/db_impl_open.cc:1127
#24 0x00005631fbd21fd4 in rocksdb::DB::Open (db_options=..., dbname="db", column_families=std::vector of length 1, capacity 1 = {...},
handles=handles@entry=0x7ffea3e8e140, dbptr=dbptr@entry=0x56320636c7c0) at ./src/rocksdb/db/db_impl_open.cc:1062
#25 0x00005631fb788d02 in RocksDBStore::do_open (this=0x56320636c700, out=..., create_if_missing=false, open_readonly=<optimized out>, cfs=0x7ffea3e8eaf0)
at ./src/kv/RocksDBStore.cc:569
#26 0x00005631fb6d7520 in BlueStore::_open_db (this=this@entry=0x5632071af000, create=create@entry=false, to_repair_db=to_repair_db@entry=false,
read_only=read_only@entry=false) at ./src/os/bluestore/BlueStore.cc:5520
#27 0x00005631fb6ee675 in BlueStore::_open_db_and_around (this=this@entry=0x5632071af000, read_only=read_only@entry=false) at ./src/os/bluestore/BlueStore.cc:5259
^[[A#28 0x00005631fb73915a in BlueStore::_mount (this=0x5632071af000, kv_only=<optimized out>, open_db=<optimized out>) at ./src/os/bluestore/BlueStore.cc:6553
#29 0x00005631fb289827 in OSD::init (this=0x563207348000) at ./src/osd/OSD.cc:2870
#30 0x00005631fb133185 in main (argc=<optimized out>, argv=<optimized out>) at ./src/ceph_osd.cc:713

------ LOG -----

2019-05-30 01:15:58.167 7f8833480f00 30 freelist no more set bits in 0x941300000
2019-05-30 01:16:01.227 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.239 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.239 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.243 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.243 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.243 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.243 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.255 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.255 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.263 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.263 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.271 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.271 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.279 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.279 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.283 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.283 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.291 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.291 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.295 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.295 7f8833480f00 10 fbmap_alloc 0x55a3813f5200 allocate 0x941300000~200000/100000,0,0
2019-05-30 01:16:01.303 7f8833480f00 1 bluestore(/var/lib/ceph/osd/rno_ceph_cal-87) allocate_bluefs_freespace gifting 0x941300000~200000 to bluefs
2019-05-30 01:16:01.303 7f8833480f00 1 bluestore(/var/lib/ceph/osd/rno_ceph_cal-87) allocate_bluefs_freespace gifting 0x941300000~200000 to bluefs

osd_87_debug_osd_10.tar.gz (334 KB) Xiaoxi Chen, 05/30/2019 10:26 AM


Related issues

Related to bluestore - Bug #37282: rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2 Need More Info
Copied to bluestore - Backport #40422: luminous: Bitmap allocator return duplicate entries which cause interval_set assert Resolved
Copied to bluestore - Backport #40423: mimic: Bitmap allocator return duplicate entries which cause interval_set assert Resolved
Copied to bluestore - Backport #40424: nautilus: Bitmap allocator return duplicate entries which cause interval_set assert Resolved

History

#1 Updated by Xiaoxi Chen about 1 year ago

Full log with debug_bluestore=30, debug_bluefs=10 are available , but too big to upload.

#2 Updated by Igor Fedotov about 1 year ago

Xiaoxi, this is v14.2.0, right?

14.2.1 & .2 have some bmap allocator's fixes that you might want to apply:
https://github.com/ceph/ceph/pull/27740
https://github.com/ceph/ceph/pull/27139

#3 Updated by Xiaoxi Chen about 1 year ago

@Igor,

I noticed that during debugging, however, 14.2.1 crashed the same way.
-5> 2019-05-30 02:23:05.375 7fd431d6cf00  5 bluestore(/var/lib/ceph/osd/rno_ceph_cal-87) allocate_bluefs_freespace gifting 0x940d00000~100000 to bluefs
-4> 2019-05-30 02:23:05.375 7fd431d6cf00 5 bluestore(/var/lib/ceph/osd/rno_ceph_cal-87) allocate_bluefs_freespace gifting 0x941100000~100000 to bluefs
-3> 2019-05-30 02:23:05.375 7fd431d6cf00 5 bluestore(/var/lib/ceph/osd/rno_ceph_cal-87) allocate_bluefs_freespace gifting 0x941300000~200000 to bluefs
-2> 2019-05-30 02:23:05.375 7fd431d6cf00 5 bluestore(/var/lib/ceph/osd/rno_ceph_cal-87) allocate_bluefs_freespace gifting 0x941300000~200000 to bluefs
-1> 2019-05-30 02:23:05.631 7fd431d6cf00 -1 /build/ceph-14.2.1/src/include/interval_set.h: In function 'void interval_set&lt;T, Map&gt;::insert(T, T, T*, T*) [with T = long unsigned int; Map = std::map&lt;long unsigned int, long unsigned int&gt;]' thread 7fd431d6cf00 time 2019-05-30 02:23:05.382766
/build/ceph-14.2.1/src/include/interval_set.h: 490: FAILED ceph_assert(p->first > start+len)

actually Stupid allocator crashed in another way where it assert for (len > 0 ) and which lead me to the two changes you are pointing, especially the uint32->64.

-3> 2019-05-30 02:22:31.347 7f60d3fbff00 10 stupidalloc 0x0x55e0ae1de3a0 allocate_int want_size 0x79300000 alloc_unit 0x100000 hint 0x0
-2> 2019-05-30 02:22:31.347 7f60d3fbff00 5 bluestore(/var/lib/ceph/osd/rno_ceph_cal-87) allocate_bluefs_freespace gifting 0xdb8700000~0 to bluefs
-1> 2019-05-30 02:22:31.359 7f60d3fbff00 -1 /build/ceph-14.2.1/src/include/interval_set.h: In function 'void interval_set&lt;T, Map&gt;::insert(T, T, T*, T*) [with T = long unsigned int; Map = std::map&lt;long unsigned int, long unsigned int&gt;]' thread 7f60d3fbff00 time 2019-05-30 02:22:31.350750
/build/ceph-14.2.1/src/include/interval_set.h: 447: FAILED ceph_assert(len > 0)

#4 Updated by Igor Fedotov about 1 year ago

  • Project changed from Ceph to bluestore

#6 Updated by Xiaoxi Chen about 1 year ago

I do see some very large directory in our fs…however the rebuilt-osd crashed again which bring down 8 PGs and as well as the filesystem….sucks..

It seems related with a huge allocation from bluefs
2019-05-30 01:16:01.223 7f8833480f00 10 bluefs _expand_slow_device expanding slow device by 0x679300000

and my guess is in somewhere we allocate something more than 4GB(like 8GB) and the higher 32bit was truncated in bluestore_interval_t, which make the len to 0?

Changing bluestore_bluefs_gift_ratio to 0.002 which make the allocation from 25GB to 2.5GB mitigate the issue at the moment

#7 Updated by Igor Fedotov about 1 year ago

  • Status changed from New to In Progress

Managed to reproduce the same by replaying the log. WIP on the fix.

#8 Updated by Igor Fedotov about 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 28496

#9 Updated by Igor Fedotov about 1 year ago

  • Backport set to mimic, luminous, nautilus

#10 Updated by Kefu Chai about 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#11 Updated by Igor Fedotov about 1 year ago

  • Copied to Backport #40422: luminous: Bitmap allocator return duplicate entries which cause interval_set assert added

#12 Updated by Igor Fedotov about 1 year ago

  • Copied to Backport #40423: mimic: Bitmap allocator return duplicate entries which cause interval_set assert added

#13 Updated by Igor Fedotov about 1 year ago

  • Copied to Backport #40424: nautilus: Bitmap allocator return duplicate entries which cause interval_set assert added

#14 Updated by Sage Weil 12 months ago

  • Related to Bug #37282: rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch code = 2 added

#15 Updated by Igor Fedotov 12 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF