Project

General

Profile

Bug #21087

BlueFS Becomes Totally Inaccessible when Failed to Allocate

Added by WANG Guoqin about 2 years ago. Updated almost 2 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Target version:
Start date:
08/23/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

After trying to insert thousands of missing osdmaps to the bluefs partitions, they now appear to be full and not accessible by ANY of the tools available.

Log:

   -34> 2017-08-23 22:09:46.011201 7f2b666a6e00  4 rocksdb: [/build/ceph-12.1.4/src/rocksdb/db/version_set.cc:2395] Creating manifest 3506

   -33> 2017-08-23 22:09:46.011204 7f2b666a6e00 10 bluefs open_for_write db/MANIFEST-003506
   -32> 2017-08-23 22:09:46.011209 7f2b666a6e00 20 bluefs open_for_write mapping db/MANIFEST-003506 to bdev 1
   -31> 2017-08-23 22:09:46.011214 7f2b666a6e00 10 bluefs open_for_write h 0x564cc4553860 on file(ino 5120 size 0x0 mtime 2017-08-23 22:09:46.011209 bdev 1 extents [])
   -30> 2017-08-23 22:09:46.011233 7f2b666a6e00 10 bluefs _flush 0x564cc4553860 ignoring, length 35 < min_flush_size 524288
   -29> 2017-08-23 22:09:46.011266 7f2b666a6e00 10 bluefs _flush 0x564cc4553860 ignoring, length 3433 < min_flush_size 524288
   -28> 2017-08-23 22:09:46.011286 7f2b666a6e00 10 bluefs _flush 0x564cc4553860 ignoring, length 3453 < min_flush_size 524288
   -27> 2017-08-23 22:09:46.011287 7f2b666a6e00 10 bluefs _flush 0x564cc4553860 ignoring, length 3453 < min_flush_size 524288
   -26> 2017-08-23 22:09:46.011289 7f2b666a6e00 10 bluefs _fsync 0x564cc4553860 file(ino 5120 size 0x0 mtime 2017-08-23 22:09:46.011209 bdev 1 extents [])
   -25> 2017-08-23 22:09:46.011294 7f2b666a6e00 10 bluefs _flush 0x564cc4553860 0x0~d7d to file(ino 5120 size 0x0 mtime 2017-08-23 22:09:46.011209 bdev 1 extents [])
   -24> 2017-08-23 22:09:46.011298 7f2b666a6e00 10 bluefs _flush_range 0x564cc4553860 pos 0x0 0x0~d7d to file(ino 5120 size 0x0 mtime 2017-08-23 22:09:46.011209 bdev 1 extents [])
   -23> 2017-08-23 22:09:46.011303 7f2b666a6e00 10 bluefs _allocate len 0xd7d from 1
   -22> 2017-08-23 22:09:46.011305 7f2b666a6e00 10 bitmapalloc:reserve instance 94887706734432 num_used 20377 total 20378
   -21> 2017-08-23 22:09:46.011308 7f2b666a6e00 10 bitmapalloc:allocate instance 94887706734432 want_size 1048576 alloc_unit 1048576 hint 0
   -20> 2017-08-23 22:09:46.011320 7f2b666a6e00 20 bluefs _flush_range dirty_seq = 651752 (was clean)
   -19> 2017-08-23 22:09:46.011323 7f2b666a6e00 20 bluefs _flush_range file now file(ino 5120 size 0xd7d mtime 2017-08-23 22:09:46.011319 bdev 1 extents [1:0x1e2400000+100000])
   -18> 2017-08-23 22:09:46.011327 7f2b666a6e00 20 bluefs _flush_range in 1:0x1e2400000+100000 x_off 0x0
   -17> 2017-08-23 22:09:46.011331 7f2b666a6e00 20 bluefs _flush_range caching tail of 0xd7d and padding block with 0x283
   -16> 2017-08-23 22:09:46.011372 7f2b666a6e00 20 bluefs _flush_range h 0x564cc4553860 pos now 0xd7d
   -15> 2017-08-23 22:09:46.011377 7f2b666a6e00 10 bluefs _claim_completed_aios got 1 aios
   -14> 2017-08-23 22:09:46.011379 7f2b666a6e00 10 bluefs wait_for_aio 0x564cc4553860
   -13> 2017-08-23 22:09:46.011532 7f2b666a6e00 10 bluefs wait_for_aio 0x564cc4553860 done in 0.000151
   -12> 2017-08-23 22:09:46.011538 7f2b666a6e00 20 bluefs flush_bdev
   -11> 2017-08-23 22:09:46.020780 7f2b666a6e00 20 bluefs _fsync file metadata was dirty (651752) on file(ino 5120 size 0xd7d mtime 2017-08-23 22:09:46.011319 bdev 1 extents [1:0x1e2400000+100000]), flushing log
   -10> 2017-08-23 22:09:46.020810 7f2b666a6e00 20 bluefs _flush_and_sync_log 1 dirty_files
    -9> 2017-08-23 22:09:46.020813 7f2b666a6e00 20 bluefs _flush_and_sync_log   op_file_update file(ino 5120 size 0xd7d mtime 2017-08-23 22:09:46.011319 bdev 1 extents [1:0x1e2400000+100000])
    -8> 2017-08-23 22:09:46.020818 7f2b666a6e00 10 bluefs _flush_and_sync_log txn(seq 651752 len 0x5e crc 0x417b62b0)
    -7> 2017-08-23 22:09:46.020822 7f2b666a6e00 10 bluefs _flush_and_sync_log allocating more log runway (0xffffffffff55a000 remaining)
    -6> 2017-08-23 22:09:46.020824 7f2b666a6e00 10 bluefs _allocate len 0x400000 from 0
    -5> 2017-08-23 22:09:46.020826 7f2b666a6e00 10 bluefs _allocate len 0x400000 from 1
    -4> 2017-08-23 22:09:46.020827 7f2b666a6e00 10 bitmapalloc:reserve instance 94887706734432 num_used 20378 total 20378
    -3> 2017-08-23 22:09:46.020829 7f2b666a6e00  1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0; fallback to bdev 2
    -2> 2017-08-23 22:09:46.020831 7f2b666a6e00 10 bluefs _allocate len 0x400000 from 2
    -1> 2017-08-23 22:09:46.020832 7f2b666a6e00 -1 bluefs _allocate failed to allocate 0x400000 on bdev 2, dne
     0> 2017-08-23 22:09:46.031080 7f2b666a6e00 -1 /build/ceph-12.1.4/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f2b666a6e00 time 2017-08-23 22:09:46.020835
/build/ceph-12.1.4/src/os/bluestore/BlueFS.cc: 1394: FAILED assert(r == 0)

 ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x564cba859042]
 2: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x183d) [0x564cba7def6d]
 3: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x3a1) [0x564cba7e05d1]
 4: (BlueRocksWritableFile::Sync()+0x63) [0x564cba7f1c03]
 5: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x31c) [0x564cbac2276c]
 6: (rocksdb::WritableFileWriter::Sync(bool)+0x132) [0x564cbac25342]
 7: (rocksdb::SyncManifest(rocksdb::Env*, rocksdb::ImmutableDBOptions const*, rocksdb::WritableFileWriter*)+0x1c3) [0x564cbac28f73]
 8: (rocksdb::VersionSet::LogAndApply(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, rocksdb::autovector<rocksdb::VersionEdit*, 8ul> const&, rocksdb::InstrumentedMutex*, rocksdb::Directory*, bool, rocksdb::ColumnFamilyOptions const*)+0x173e) [0x564cbaba899e]
 9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool)+0x1c48) [0x564cbab71e28]
 10: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x8c4) [0x564cbab72f34]
 11: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xedc) [0x564cbab742fc]
 12: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x6b1) [0x564cbab75b91]
 13: (RocksDBStore::do_open(std::ostream&, bool)+0x8ff) [0x564cba75d19f]
 14: (BlueStore::_open_db(bool)+0xf73) [0x564cba6ea3f3]
 15: (BlueStore::_mount(bool)+0x38d) [0x564cba71a2cd]
 16: (main()+0x14ab) [0x564cba1a672b]
 17: (__libc_start_main()+0xf1) [0x7f2b63b0c3f1]
 18: (_start()+0x29) [0x564cba233a69]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The problem was, now that almost all tools are refusing to access this partition, I literally have no way to diagnose the problem, nor to modify data to solve the problem (like deleting something to make room).

AFAIK these commands don’t work with the same log shown above,
  • ceph-osd with any param
  • ceph-objectstore-tool with —op info, log, fsck, dup, export, set-osdmap, list-pgs, etc.
  • ceph-bluestore-tool with —command fsck or deep fsck
  • ceph-bluestore-tool with —command bluefs-export or show-label work, but they won’t help

So that my attempt to either remove something to make room, or to export the pgs to a new place, they simply don’t work anyway.

The only thing I could have done was to extend the partition, hopefully to make more room. But the result was even stranger. After extending the partition from 10G to 20G, the line found in the log,

    -4> 2017-08-23 22:09:46.020827 7f2b666a6e00 10 bitmapalloc:reserve instance 94887706734432 num_used 10378 total 10378

simply became
    -4> 2017-08-23 22:09:46.020827 7f2b666a6e00 10 bitmapalloc:reserve instance 94887706734432 num_used 20378 total 20378

with 10000 more blocks in total, and all of them being used.

Well, I’m totally lost and am wondering what I can do next. Hope someone could help me.

BTW this editor killed my chrome dozens of times :(

History

#1 Updated by WANG Guoqin about 2 years ago

Some of my investigations:

  • No space seems allocatable on BlueFS, obviously. BlueFS::_allocate failed with r = alloc[id]->reserve(left)
  • m_used_blocks doesn't seem to be right because it always equals total_blocks
  • Using either bitmap allocator or stupid allocator didn't help
  • Setting different bluestore_block_db_size or bluestore_block_wal_size didn't help
  • --skip-journal-replay, --skip-mount-omap, --force options in ceph-objectstore-tool didn't help

#2 Updated by Josh Durgin about 2 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Priority changed from Normal to High
  • Component(RADOS) BlueStore added

#3 Updated by WANG Guoqin about 2 years ago

More specifically, for almost any of the above ops to run, BlueStore::_mount() must be called in advance, therefore BlueFS::_fsync() and BlueFS::_flush_and_sync_log() will be called, which will definitely result in a failure in allocation.

An external exporting or fscking program, without mounting, could probably fix this problem. Otherwise, maybe we could add options to do extra fixing ops before mounting the filesystem.

#4 Updated by Sage Weil about 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil

rocksdb itself is doing some writes during mount that are then failing. first step is probably to make fsck operate open rocksdb in read-only mode...

#5 Updated by Sage Weil about 2 years ago

I've pushed a wip-bluefs-luminous branch (see shaman.ceph.com for builds) that adds 2 new commands to ceph-bluestore-tool:

1. 'ceph-bluestore-tool bluefs-bdev-sizes --path ...' will show you how big each device is and which parts of the device bluefs thinks it owns.
2. 'ceph-bluestore-tool bluefs-bdev-expand --path ...' will expand the bluefs owned portion to include the whole device (for wal and db only) if the block device is bigger than what is being used.

Can you give it a go? Thanks!

#6 Updated by WANG Guoqin about 2 years ago

A quick response of the result, before me taking some time to investigate,

$ sudo ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-13
infering bluefs devices from bluestore path
2017-09-09 01:05:59.466919 7fd444934e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2017-09-09 01:05:59.467106 7fd444934e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2017-09-09 01:05:59.469023 7fd444934e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
action bluefs-bdev-sizes
 slot 1 /var/lib/ceph/osd/ceph-13/block
2017-09-09 01:05:59.469865 7fd444934e00  1 bdev create path /var/lib/ceph/osd/ceph-13/block type kernel
2017-09-09 01:05:59.469937 7fd444934e00  1 bdev(0x5614e342e480 /var/lib/ceph/osd/ceph-13/block) open path /var/lib/ceph/osd/ceph-13/block
2017-09-09 01:05:59.470221 7fd444934e00  1 bdev(0x5614e342e480 /var/lib/ceph/osd/ceph-13/block) open size 21368909824 (0x4f9afb000, 20378 MB) block_size 4096 (4096 B) rotational
2017-09-09 01:05:59.470292 7fd444934e00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-13/block size 20378 MB
2017-09-09 01:05:59.470355 7fd444934e00  1 bluefs mount
1 : size 0x4f9afb000 : own 0x[2000~400fe000,1388b0000~500000,138df0000~600000,1394c0000~600000,139bc0000~200000,139e80000~100000,13a070000~d00000,13adc0000~600000,13b460000~100000,13b650000~500000,13bbc0000~400000,13c000000~400000,13c510000~100000,13c650000~300000,13ca20000~100000,13cb90000~400000,13d250000~500000,13d770000~1500000,13ef00000~400000,13f390000~400000,13f890000~100000,13faf0000~300000,13fe70000~100000,140000000~400000,14d360000~300000,14d720000~b00000,14e2a0000~800000,14eb70000~400000,14f030000~c00000,14fc50000~300000,150000000~c00000,151680000~100000,151960000~100000,151d50000~100000,151f70000~900000,152970000~1200000,154000000~1000000,1550f0000~1100000,1563b0000~1500000,1578e0000~600000,1581c0000~1500000,18fac0000~500000,190000000~4700000,1dc000000~3300000,1df420000~b00000,1e0000000~2800000]
2017-09-09 01:05:59.485769 7fd444934e00  1 bdev(0x5614e342e480 /var/lib/ceph/osd/ceph-13/block) close
$ sudo ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-13
infering bluefs devices from bluestore path
2017-09-09 01:06:39.948481 7f8c0f353e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2017-09-09 01:06:39.948893 7f8c0f353e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2017-09-09 01:06:39.950769 7f8c0f353e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
action bluefs-bdev-expand
 slot 1 /var/lib/ceph/osd/ceph-13/block
2017-09-09 01:06:39.951710 7f8c0f353e00  1 bdev create path /var/lib/ceph/osd/ceph-13/block type kernel
2017-09-09 01:06:39.951726 7f8c0f353e00  1 bdev(0x558f9b3ca480 /var/lib/ceph/osd/ceph-13/block) open path /var/lib/ceph/osd/ceph-13/block
2017-09-09 01:06:39.951946 7f8c0f353e00  1 bdev(0x558f9b3ca480 /var/lib/ceph/osd/ceph-13/block) open size 21368909824 (0x4f9afb000, 20378 MB) block_size 4096 (4096 B) rotational
2017-09-09 01:06:39.951956 7f8c0f353e00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-13/block size 20378 MB
2017-09-09 01:06:39.951958 7f8c0f353e00  1 bluefs mount
start:
1 : size 0x4f9afb000 : own 0x[2000~400fe000,1388b0000~500000,138df0000~600000,1394c0000~600000,139bc0000~200000,139e80000~100000,13a070000~d00000,13adc0000~600000,13b460000~100000,13b650000~500000,13bbc0000~400000,13c000000~400000,13c510000~100000,13c650000~300000,13ca20000~100000,13cb90000~400000,13d250000~500000,13d770000~1500000,13ef00000~400000,13f390000~400000,13f890000~100000,13faf0000~300000,13fe70000~100000,140000000~400000,14d360000~300000,14d720000~b00000,14e2a0000~800000,14eb70000~400000,14f030000~c00000,14fc50000~300000,150000000~c00000,151680000~100000,151960000~100000,151d50000~100000,151f70000~900000,152970000~1200000,154000000~1000000,1550f0000~1100000,1563b0000~1500000,1578e0000~600000,1581c0000~1500000,18fac0000~500000,190000000~4700000,1dc000000~3300000,1df420000~b00000,1e0000000~2800000]
/build/ceph-12.2.0-50-g94aff58/src/include/interval_set.h: In function 'T interval_set<T>::range_end() const [with T = long unsigned int]' thread 7f8c0f353e00 time 2017-09-09 01:06:39.967997
/build/ceph-12.2.0-50-g94aff58/src/include/interval_set.h: 325: FAILED assert(!empty())
 ceph version 12.2.0-50-g94aff58 (94aff58a5e9c2e4c70e3efe17c5dfa82b1f537ae) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f8c0601d5c2]
 2: (main()+0x2957) [0x558f9a176ab7]
 3: (__libc_start_main()+0xf1) [0x7f8c047d43f1]
 4: (_start()+0x29) [0x558f9a1f6349]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-09-09 01:06:39.968866 7f8c0f353e00 -1 /build/ceph-12.2.0-50-g94aff58/src/include/interval_set.h: In function 'T interval_set<T>::range_end() const [with T = long unsigned int]' thread 7f8c0f353e00 time 2017-09-09 01:06:39.967997
/build/ceph-12.2.0-50-g94aff58/src/include/interval_set.h: 325: FAILED assert(!empty())

 ceph version 12.2.0-50-g94aff58 (94aff58a5e9c2e4c70e3efe17c5dfa82b1f537ae) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f8c0601d5c2]
 2: (main()+0x2957) [0x558f9a176ab7]
 3: (__libc_start_main()+0xf1) [0x7f8c047d43f1]
 4: (_start()+0x29) [0x558f9a1f6349]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -27> 2017-09-09 01:06:39.943965 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command perfcounters_dump hook 0x558f9b3c8200
   -26> 2017-09-09 01:06:39.943977 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command 1 hook 0x558f9b3c8200
   -25> 2017-09-09 01:06:39.943980 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command perf dump hook 0x558f9b3c8200
   -24> 2017-09-09 01:06:39.943983 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command perfcounters_schema hook 0x558f9b3c8200
   -23> 2017-09-09 01:06:39.943985 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command perf histogram dump hook 0x558f9b3c8200
   -22> 2017-09-09 01:06:39.943987 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command 2 hook 0x558f9b3c8200
   -21> 2017-09-09 01:06:39.943989 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command perf schema hook 0x558f9b3c8200
   -20> 2017-09-09 01:06:39.943991 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command perf histogram schema hook 0x558f9b3c8200
   -19> 2017-09-09 01:06:39.943994 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command perf reset hook 0x558f9b3c8200
   -18> 2017-09-09 01:06:39.943996 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command config show hook 0x558f9b3c8200
   -17> 2017-09-09 01:06:39.943999 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command config help hook 0x558f9b3c8200
   -16> 2017-09-09 01:06:39.944001 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command config set hook 0x558f9b3c8200
   -15> 2017-09-09 01:06:39.944004 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command config get hook 0x558f9b3c8200
   -14> 2017-09-09 01:06:39.944006 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command config diff hook 0x558f9b3c8200
   -13> 2017-09-09 01:06:39.944008 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command config diff get hook 0x558f9b3c8200
   -12> 2017-09-09 01:06:39.944011 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command log flush hook 0x558f9b3c8200
   -11> 2017-09-09 01:06:39.944012 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command log dump hook 0x558f9b3c8200
   -10> 2017-09-09 01:06:39.944015 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command log reopen hook 0x558f9b3c8200
    -9> 2017-09-09 01:06:39.944024 7f8c0f353e00  5 asok(0x558f9b40d2c0) register_command dump_mempools hook 0x558f9b6a1268
    -8> 2017-09-09 01:06:39.948481 7f8c0f353e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
    -7> 2017-09-09 01:06:39.948893 7f8c0f353e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
    -6> 2017-09-09 01:06:39.950769 7f8c0f353e00 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
    -5> 2017-09-09 01:06:39.951710 7f8c0f353e00  1 bdev create path /var/lib/ceph/osd/ceph-13/block type kernel
    -4> 2017-09-09 01:06:39.951726 7f8c0f353e00  1 bdev(0x558f9b3ca480 /var/lib/ceph/osd/ceph-13/block) open path /var/lib/ceph/osd/ceph-13/block
    -3> 2017-09-09 01:06:39.951946 7f8c0f353e00  1 bdev(0x558f9b3ca480 /var/lib/ceph/osd/ceph-13/block) open size 21368909824 (0x4f9afb000, 20378 MB) block_size 4096 (4096 B) rotational
    -2> 2017-09-09 01:06:39.951956 7f8c0f353e00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-13/block size 20378 MB
    -1> 2017-09-09 01:06:39.951958 7f8c0f353e00  1 bluefs mount
     0> 2017-09-09 01:06:39.968866 7f8c0f353e00 -1 /build/ceph-12.2.0-50-g94aff58/src/include/interval_set.h: In function 'T interval_set<T>::range_end() const [with T = long unsigned int]' thread 7f8c0f353e00 time 2017-09-09 01:06:39.967997
/build/ceph-12.2.0-50-g94aff58/src/include/interval_set.h: 325: FAILED assert(!empty())

 ceph version 12.2.0-50-g94aff58 (94aff58a5e9c2e4c70e3efe17c5dfa82b1f537ae) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f8c0601d5c2]
 2: (main()+0x2957) [0x558f9a176ab7]
 3: (__libc_start_main()+0xf1) [0x7f8c047d43f1]
 4: (_start()+0x29) [0x558f9a1f6349]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent       500
  max_new         1000
  log_file 
--- end dump of recent events ---
*** Caught signal (Aborted) **
 in thread 7f8c0f353e00 thread_name:ceph-bluestore-
 ceph version 12.2.0-50-g94aff58 (94aff58a5e9c2e4c70e3efe17c5dfa82b1f537ae) luminous (stable)
 1: (()+0x414c64) [0x558f9a452c64]
 2: (()+0x11670) [0x7f8c05840670]
 3: (gsignal()+0x9f) [0x7f8c047e977f]
 4: (abort()+0x16a) [0x7f8c047eb37a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7f8c0601d74e]
 6: (main()+0x2957) [0x558f9a176ab7]
 7: (__libc_start_main()+0xf1) [0x7f8c047d43f1]
 8: (_start()+0x29) [0x558f9a1f6349]
2017-09-09 01:06:39.972532 7f8c0f353e00 -1 *** Caught signal (Aborted) **
 in thread 7f8c0f353e00 thread_name:ceph-bluestore-

 ceph version 12.2.0-50-g94aff58 (94aff58a5e9c2e4c70e3efe17c5dfa82b1f537ae) luminous (stable)
 1: (()+0x414c64) [0x558f9a452c64]
 2: (()+0x11670) [0x7f8c05840670]
 3: (gsignal()+0x9f) [0x7f8c047e977f]
 4: (abort()+0x16a) [0x7f8c047eb37a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7f8c0601d74e]
 6: (main()+0x2957) [0x558f9a176ab7]
 7: (__libc_start_main()+0xf1) [0x7f8c047d43f1]
 8: (_start()+0x29) [0x558f9a1f6349]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2017-09-09 01:06:39.972532 7f8c0f353e00 -1 *** Caught signal (Aborted) **
 in thread 7f8c0f353e00 thread_name:ceph-bluestore-

 ceph version 12.2.0-50-g94aff58 (94aff58a5e9c2e4c70e3efe17c5dfa82b1f537ae) luminous (stable)
 1: (()+0x414c64) [0x558f9a452c64]
 2: (()+0x11670) [0x7f8c05840670]
 3: (gsignal()+0x9f) [0x7f8c047e977f]
 4: (abort()+0x16a) [0x7f8c047eb37a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x7f8c0601d74e]
 6: (main()+0x2957) [0x558f9a176ab7]
 7: (__libc_start_main()+0xf1) [0x7f8c047d43f1]
 8: (_start()+0x29) [0x558f9a1f6349]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent       500
  max_new         1000
  log_file 
--- end dump of recent events ---
Aborted

#7 Updated by WANG Guoqin about 2 years ago

Seems that I don't have a WAL bdev, do I? Maybe I created the bluefs using a too-early release?

#8 Updated by Eric Nelson almost 2 years ago

We're seeing this bug with our cachetier SSDs that have DBs located on 14G nvme partitions. Let me know if you'd like me to generate any debug info that would prove helpful. We're on 12.2.1 as well.

#9 Updated by WANG Guoqin almost 2 years ago

Eric Nelson wrote:

We're seeing this bug with our cachetier SSDs that have DBs located on 14G nvme partitions. Let me know if you'd like me to generate any debug info that would prove helpful. We're on 12.2.1 as well.

With the emergency patch made by Sage (and other efforts I did), only one OSD, among five of them, were able to be up and in, for a short time. Restarting killed it, like the other four. And then I decided to purge them. Thankfully only a few pgs were missing, and I was able to rescue most of the files. The cluster was then running normally, but some other error made me decide to recreate the cluster, and that was another story.

#10 Updated by Greg Farnum almost 2 years ago

  • Project changed from RADOS to bluestore

#11 Updated by Sage Weil almost 2 years ago

  • Status changed from In Progress to Can't reproduce

Hmm, yeah I misunderstood the original problem--I thought bluefs was on a dedicated device.

There have been several fixes to the freespace balancing since 12.2.0. If you see anything remotely like this with 12.2.1 or later please let us know!

Also available in: Atom PDF