Project

General

Profile

Bug #44774

ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefs

Added by Honggang Yang 6 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus, octopus
Regression:
No
Severity:
1 - critical

Description

bluefs-bdev-new-wal

# ceph-bluestore-tool --command bluefs-bdev-new-wal --path dev/osd0 --dev-target dev/osd0/mywal

The process of this command is as follows:

1. mount bluefs

r = _mount_for_bluefs();

2. init NEWAL/NEWDB devices, then it can service io requests

r = bluefs->add_block_device(...);

3. set label for the new device

r = _check_or_set_bdev_label(..., "bluefs wal", ...);

4. remount bluefs

bluefs->umount();
bluefs->mount();

5. add new deivce's space to bluefs

bluefs->add_block_extent(id, reserved, bluefs->get_block_device_size(id) - reserved);

This will add a 'init_add_free' record into old journal and persist to disk

6. compact journal and write new journal to new deivce

_compact_log_dump_metadata(&t, flags);
...
encode(t, bl);
...
log_file->fnode.size = bl.length();
...
log_writer = _create_writer(log_file);
log_writer->append(bl);
r = _flush(log_writer, true);
...
flush_bdev();

7. update super block and persist to disk

...
++super.version;
_write_super(super_dev);
fluse_bdev();
...

Problem

If this process is interrupted after step 5 and before step 7, bluefs can not boot up again.

We add a 'init_add_free' record of new device to old journal which will cause osd core during bluefs replay stage.
During bluefs replay, our new device is BDEV_WAL(not BDEV_NEWWAL), but the 'init_add_free' record's content is as follows:

op_alloc_add(BDEV_NEWWAL:SSS-LLL)

block[BDEV_NEWWAL] is null now, so this will cause osd core.

 alloc[id]->init_add_free(offset, length); /// id is BDEV_NEWWAL

For the latest ceph master branch, it will cause a divide by zero error.

Program terminated with signal SIGFPE, Arithmetic exception.
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f733ff06d80 (LWP 1162174))]
(gdb)
(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00005616f7ee1d22 in reraise_fatal (signum=8) at /home/new/ceph/src/global/signal_handler.cc:87
#2  0x00005616f7ee3160 in handle_fatal_signal (signum=8) at /home/new/ceph/src/global/signal_handler.cc:332
#3  <signal handler called>
#4  0x00005616f7b51478 in round_up_to<unsigned long, unsigned long> (n=0, d=0) at /home/new/ceph/src/include/intarith.h:28
#5  0x00005616f7e7de13 in apply_for_bitset_range<boost::dynamic_bitset<long unsigned int>, BlueFS::_replay(bool, bool)::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int>&)> >(uint64_t, uint64_t, uint64_t, boost::dynamic_bitset<unsigned long, std::allocator<unsigned long> > &, BlueFS::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int, std::allocator<long unsigned int> >&)>) (off=0,
    len=0, granularity=0, bitset=..., f=...) at /home/new/ceph/src/os/bluestore/bluestore_common.h:27
#6  0x00005616f7e61352 in BlueFS::_replay (this=0x5617048ea000, noop=false, to_stdout=false) at /home/new/ceph/src/os/bluestore/BlueFS.cc:1190
#7  0x00005616f7e5b139 in BlueFS::mount (this=0x5617048ea000) at /home/new/ceph/src/os/bluestore/BlueFS.cc:661
#8  0x00005616f7cd91f9 in BlueStore::_open_bluefs (this=0x561703c95000, create=false) at /home/new/ceph/src/os/bluestore/BlueStore.cc:5474
#9  0x00005616f7cdaf50 in BlueStore::_open_db (this=0x561703c95000, create=false, to_repair_db=false, read_only=true) at /home/new/ceph/src/os/bluestore/BlueStore.cc:5682
#10 0x00005616f7cd99e7 in BlueStore::_open_db_and_around (this=0x561703c95000, read_only=true) at /home/new/ceph/src/os/bluestore/BlueStore.cc:5530
#11 0x00005616f7cf3976 in BlueStore::_fsck (this=0x561703c95000, depth=BlueStore::FSCK_REGULAR, repair=false) at /home/new/ceph/src/os/bluestore/BlueStore.cc:8156
#12 0x00005616f7d5eeca in BlueStore::fsck (this=0x561703c95000, deep=false) at /home/new/ceph/src/os/bluestore/BlueStore.h:2512
#13 0x00005616f7ce9ffd in BlueStore::_mount (this=0x561703c95000, kv_only=false, open_db=true) at /home/new/ceph/src/os/bluestore/BlueStore.cc:6948
#14 0x00005616f7d5ee92 in BlueStore::mount (this=0x561703c95000) at /home/new/ceph/src/os/bluestore/BlueStore.h:2493
#15 0x00005616f74b40ed in OSD::init (this=0x5617049fc000) at /home/new/ceph/src/osd/OSD.cc:3287
#16 0x00005616f7479ef1 in main (argc=5, argv=0x7ffe44f36fa8) at /home/new/ceph/src/ceph_osd.cc:703
(gdb) f 5
#5  0x00005616f7e7de13 in apply_for_bitset_range<boost::dynamic_bitset<long unsigned int>, BlueFS::_replay(bool, bool)::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int>&)> >(uint64_t, uint64_t, uint64_t, boost::dynamic_bitset<unsigned long, std::allocator<unsigned long> > &, BlueFS::<lambda(uint64_t, boost::dynamic_bitset<long unsigned int, std::allocator<long unsigned int> >&)>) (off=0,
    len=0, granularity=0, bitset=..., f=...) at /home/new/ceph/src/os/bluestore/bluestore_common.h:27
27        auto end = round_up_to(off + len, granularity) / granularity;
(gdb) l
22      void apply_for_bitset_range(uint64_t off,
23        uint64_t len,
24        uint64_t granularity,
25        Bitset &bitset,
26        Func f) {
27        auto end = round_up_to(off + len, granularity) / granularity;
28        ceph_assert(end <= bitset.size());
29        uint64_t pos = off / granularity;
30        while (pos < end) {
31          f(pos, bitset);
(gdb) p granularity
$1 = 0
(gdb) f 6
#6  0x00005616f7e61352 in BlueFS::_replay (this=0x5617048ea000, noop=false, to_stdout=false) at /home/new/ceph/src/os/bluestore/BlueFS.cc:1190
1190                  apply_for_bitset_range(offset, length, alloc_size[id], owned_blocks[id],
(gdb) l
1185                  alloc[id]->init_add_free(offset, length);
1186                }
1187
1188                if (cct->_conf->bluefs_log_replay_check_allocations) {
1189                  bool fail = false;
1190                  apply_for_bitset_range(offset, length, alloc_size[id], owned_blocks[id],
1191                    [&](uint64_t pos, boost::dynamic_bitset<uint64_t> &bs) {
1192                      if (bs.test(pos)) {
1193                        fail = true;
1194                      } else {
(gdb) p alloc
$2 = std::vector of length 5, capacity 5 = {0x561703c103c0, 0x561703c11e00, 0x0, 0x0, 0x0}
(gdb) p id
$3 = 3 '\003'
(gdb) p offset
$4 = 0
(gdb) p length
$5 = 0
(gdb)
$6 = 0
(gdb) p alloc_size
$7 = std::vector of length 5, capacity 5 = {1048576, 65536, 0, 0, 0}

Proposal

Don't write 'init_add_free' record for new deivce to old journal.

If process is interrupted just before step 7, all goes well. Of course, you have to delete the wal link before start the osd.


Related issues

Copied to bluestore - Backport #45044: octopus: ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefs Resolved
Copied to bluestore - Backport #45045: nautilus: ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefs Resolved

History

#2 Updated by Honggang Yang 6 months ago

How to reproduce this problem

1. modify your vstart.sh

  1. git diff ../src/vstart.sh
    diff --git a/src/vstart.sh b/src/vstart.sh
    index c81974dd2c..85c4121320 100755
    --- a/src/vstart.sh
    ++ b/src/vstart.sh
    @ -657,12 +657,12 @ EOF
    bluestore_block_wal_create = false
    bluestore_spdk_mem = 2048"
    else
    - BLUESTORE_OPTS=" bluestore block db path = $CEPH_DEV_DIR/osd\$id/block.db.file
    - bluestore block db size = 1073741824
    - bluestore block db create = true
    - bluestore block wal path = $CEPH_DEV_DIR/osd\$id/block.wal.file
    - bluestore block wal size = 1048576000
    - bluestore block wal create = true"
    BLUESTORE_OPTS=" bluestore block db path = \"\"
    + bluestore block db size = 0
    + bluestore block db create = false
    + bluestore block wal path = \"\"
    + bluestore block wal size = 0
    + bluestore block wal create = false"
    fi
    fi

2. setup a cluster with only one bluestore osd

  1. MON=1 OSD=1 RGW=0 MDS=0 ../src/vstart.sh --without-dashboard -n
  1. ceph-bluestore-tool --command show-label --path dev/osd0/
    inferring bluefs devices from bluestore path {
    "dev/osd0/block": {
    "osd_uuid": "fa5e0b34-5df5-4d57-a55b-b1f2c0eef34a",
    "size": 107374182400,
    "btime": "2020-03-31T01:21:56.758548+0000",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "253468e7-a38f-4c06-b7c5-77cb183b3f87",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "osd_key": "AQA0m4JevaEwHhAAn5Lx8hOvDnG7e6TOy1ENTg==",
    "ready": "ready",
    "require_osd_release": "15",
    "whoami": "0"
    }
    }

3. create a new wal device file

  1. truncate dev/osd0/mywal --size 5G

4. attach new wal device to bluestore and interrupt before _write_super() is executed

  1. ../src/stop.sh
  2. gdb ceph-bluestore-tool
    GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
    Copyright (C) 2018 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/&gt;.
    Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/&gt;.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ceph-bluestore-tool...done.
(gdb) set args --command bluefs-bdev-new-wal --path dev/osd0/ --dev-target dev/osd0/mywal
(gdb) b _write_super
Breakpoint 1 at 0x9ca1fb: file /home/yhg/ceph/src/os/bluestore/BlueFS.cc, line 737.
(gdb) run
Starting program: /home/yhg/ceph/build/bin/ceph-bluestore-tool --command bluefs-bdev-new-wal --path dev/osd0/ --dev-target dev/osd0/mywal
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
inferring bluefs devices from bluestore path
[New Thread 0x7fffe96a9700 (LWP 2903708)]
[New Thread 0x7fffe8ea8700 (LWP 2903709)]
[New Thread 0x7fffe86a7700 (LWP 2903710)]
[New Thread 0x7fffe7ea6700 (LWP 2903711)]
[New Thread 0x7fffe76a5700 (LWP 2903712)]
[New Thread 0x7fffe6ea4700 (LWP 2903713)]

Thread 1 "ceph-bluestore-" hit Breakpoint 1, BlueFS::_write_super (this=0x555557576c00, dev=1) at /home/yhg/ceph/src/os/bluestore/BlueFS.cc:737
737 {
(gdb) quit
A debugging session is active.

Inferior 1 [process 2903131] will be killed.

Quit anyway? (y or n) y

5. check devices

  1. ceph-bluestore-tool --command show-label --path dev/osd0/
    inferring bluefs devices from bluestore path {
    "dev/osd0/block": {
    "osd_uuid": "ee20b980-ecd8-45bc-80f9-5967ed0e1786",
    "size": 107374182400,
    "btime": "2020-03-31T01:39:06.680805+0000",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "948cad3c-ebfc-4fba-bb0a-9efe04893023",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "osd_key": "AQA6n4JeDfkABRAAXklvHPkIiVgKjKogtoHdMg==",
    "ready": "ready",
    "require_osd_release": "15",
    "whoami": "0"
    },
    "dev/osd0/block.wal": {
    "osd_uuid": "ee20b980-ecd8-45bc-80f9-5967ed0e1786",
    "size": 5368709120,
    "btime": "2020-03-31T01:43:08.619388+0000",
    "description": "bluefs wal"
    }
    }

Your can never bootup this osd now.

6. even delete block.wal, you still can not bootup this osd

  1. rm dev/osd0/block.wal

#3 Updated by Kefu Chai 6 months ago

  • Assignee set to Honggang Yang
  • Pull request ID set to 34219

#4 Updated by Kefu Chai 6 months ago

  • Status changed from New to Pending Backport
  • Backport set to nautilus, octopus

#5 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #45044: octopus: ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefs added

#6 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #45045: nautilus: ceph-bluestore-tool --command bluefs-bdev-new-wal may damage bluefs added

#7 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF