Project

General

Profile

Actions

Bug #44774

closed

ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefs

Added by Honggang Yang about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus, octopus
Regression:
No
Severity:
1 - critical
Reviewed:

Description

bluefs-bdev-new-wal

# ceph-bluestore-tool --command bluefs-bdev-new-wal --path dev/osd0 --dev-target dev/osd0/mywal

The process of this command is as follows:

1. mount bluefs

r = _mount_for_bluefs();

2. init NEWAL/NEWDB devices, then it can service io requests

r = bluefs->add_block_device(...);

3. set label for the new device

r = _check_or_set_bdev_label(..., "bluefs wal", ...);

4. remount bluefs

bluefs->umount();
bluefs->mount();

5. add new deivce's space to bluefs

bluefs->add_block_extent(id, reserved, bluefs->get_block_device_size(id) - reserved);

This will add a 'init_add_free' record into old journal and persist to disk

6. compact journal and write new journal to new deivce

_compact_log_dump_metadata(&t, flags);
...
encode(t, bl);
...
log_file->fnode.size = bl.length();
...
log_writer = _create_writer(log_file);
log_writer->append(bl);
r = _flush(log_writer, true);
...
flush_bdev();

7. update super block and persist to disk

...
++super.version;
_write_super(super_dev);
fluse_bdev();
...

Problem

If this process is interrupted after step 5 and before step 7, bluefs can not boot up again.

We add a 'init_add_free' record of new device to old journal which will cause osd core during bluefs replay stage.
During bluefs replay, our new device is BDEV_WAL(not BDEV_NEWWAL), but the 'init_add_free' record's content is as follows:

op_alloc_add(BDEV_NEWWAL:SSS-LLL)

block[BDEV_NEWWAL] is null now, so this will cause osd core.

 alloc[id]->init_add_free(offset, length); /// id is BDEV_NEWWAL

For the latest ceph master branch, it will cause a divide by zero error.

Program terminated with signal SIGFPE, Arithmetic exception.
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f733ff06d80 (LWP 1162174))]
(gdb)
(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00005616f7ee1d22 in reraise_fatal (signum=8) at /home/new/ceph/src/global/signal_handler.cc:87
#2  0x00005616f7ee3160 in handle_fatal_signal (signum=8) at /home/new/ceph/src/global/signal_handler.cc:332
#3  <signal handler called>
#4  0x00005616f7b51478 in round_up_to<unsigned long, unsigned long> (n=0, d=0) at /home/new/ceph/src/include/intarith.h:28
#5  0x00005616f7e7de13 in apply_for_bitset_range<boost::dynamic_bitset<long unsigned int>, BlueFS::_replay(bool, bool)::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int>&)> >(uint64_t, uint64_t, uint64_t, boost::dynamic_bitset<unsigned long, std::allocator<unsigned long> > &, BlueFS::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int, std::allocator<long unsigned int> >&)>) (off=0,
    len=0, granularity=0, bitset=..., f=...) at /home/new/ceph/src/os/bluestore/bluestore_common.h:27
#6  0x00005616f7e61352 in BlueFS::_replay (this=0x5617048ea000, noop=false, to_stdout=false) at /home/new/ceph/src/os/bluestore/BlueFS.cc:1190
#7  0x00005616f7e5b139 in BlueFS::mount (this=0x5617048ea000) at /home/new/ceph/src/os/bluestore/BlueFS.cc:661
#8  0x00005616f7cd91f9 in BlueStore::_open_bluefs (this=0x561703c95000, create=false) at /home/new/ceph/src/os/bluestore/BlueStore.cc:5474
#9  0x00005616f7cdaf50 in BlueStore::_open_db (this=0x561703c95000, create=false, to_repair_db=false, read_only=true) at /home/new/ceph/src/os/bluestore/BlueStore.cc:5682
#10 0x00005616f7cd99e7 in BlueStore::_open_db_and_around (this=0x561703c95000, read_only=true) at /home/new/ceph/src/os/bluestore/BlueStore.cc:5530
#11 0x00005616f7cf3976 in BlueStore::_fsck (this=0x561703c95000, depth=BlueStore::FSCK_REGULAR, repair=false) at /home/new/ceph/src/os/bluestore/BlueStore.cc:8156
#12 0x00005616f7d5eeca in BlueStore::fsck (this=0x561703c95000, deep=false) at /home/new/ceph/src/os/bluestore/BlueStore.h:2512
#13 0x00005616f7ce9ffd in BlueStore::_mount (this=0x561703c95000, kv_only=false, open_db=true) at /home/new/ceph/src/os/bluestore/BlueStore.cc:6948
#14 0x00005616f7d5ee92 in BlueStore::mount (this=0x561703c95000) at /home/new/ceph/src/os/bluestore/BlueStore.h:2493
#15 0x00005616f74b40ed in OSD::init (this=0x5617049fc000) at /home/new/ceph/src/osd/OSD.cc:3287
#16 0x00005616f7479ef1 in main (argc=5, argv=0x7ffe44f36fa8) at /home/new/ceph/src/ceph_osd.cc:703
(gdb) f 5
#5  0x00005616f7e7de13 in apply_for_bitset_range<boost::dynamic_bitset<long unsigned int>, BlueFS::_replay(bool, bool)::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int>&)> >(uint64_t, uint64_t, uint64_t, boost::dynamic_bitset<unsigned long, std::allocator<unsigned long> > &, BlueFS::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int, std::allocator<long unsigned int> >&)>) (off=0,
    len=0, granularity=0, bitset=..., f=...) at /home/new/ceph/src/os/bluestore/bluestore_common.h:27
27        auto end = round_up_to(off + len, granularity) / granularity;
(gdb) l
22      void apply_for_bitset_range(uint64_t off,
23        uint64_t len,
24        uint64_t granularity,
25        Bitset &bitset,
26        Func f) {
27        auto end = round_up_to(off + len, granularity) / granularity;
28        ceph_assert(end <= bitset.size());
29        uint64_t pos = off / granularity;
30        while (pos < end) {
31          f(pos, bitset);
(gdb) p granularity
$1 = 0
(gdb) f 6
#6  0x00005616f7e61352 in BlueFS::_replay (this=0x5617048ea000, noop=false, to_stdout=false) at /home/new/ceph/src/os/bluestore/BlueFS.cc:1190
1190                  apply_for_bitset_range(offset, length, alloc_size[id], owned_blocks[id],
(gdb) l
1185                  alloc[id]->init_add_free(offset, length);
1186                }
1187
1188                if (cct->_conf->bluefs_log_replay_check_allocations) {
1189                  bool fail = false;
1190                  apply_for_bitset_range(offset, length, alloc_size[id], owned_blocks[id],
1191                    [&](uint64_t pos, boost::dynamic_bitset<uint64_t> &bs) {
1192                      if (bs.test(pos)) {
1193                        fail = true;
1194                      } else {
(gdb) p alloc
$2 = std::vector of length 5, capacity 5 = {0x561703c103c0, 0x561703c11e00, 0x0, 0x0, 0x0}
(gdb) p id
$3 = 3 '\003'
(gdb) p offset
$4 = 0
(gdb) p length
$5 = 0
(gdb)
$6 = 0
(gdb) p alloc_size
$7 = std::vector of length 5, capacity 5 = {1048576, 65536, 0, 0, 0}

Proposal

Don't write 'init_add_free' record for new deivce to old journal.

If process is interrupted just before step 7, all goes well. Of course, you have to delete the wal link before start the osd.


Related issues 2 (0 open2 closed)

Copied to bluestore - Backport #45044: octopus: ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefsResolvedNathan CutlerActions
Copied to bluestore - Backport #45045: nautilus: ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefsResolvedNathan CutlerActions
Actions

Also available in: Atom PDF