Actions
Bug #49110
closedBlueFS.cc: 1542: FAILED assert(r == 0)
Status:
Won't Fix
Priority:
Low
Assignee:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
All the SSD based OSDs in my ceph cluster crashed.
The initial error was:
2021-01-31 05:15:21.130636 7f81c7bd6700 1 bluefs add_block_extent bdev 1 0xe8db800000~300000 2021-01-31 05:15:34.251287 7f81c7bd6700 1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x3c0000; fallback to bdev 2 2021-01-31 05:15:34.251309 7f81c7bd6700 -1 bluefs _allocate failed to allocate 0x on bdev 2, dne 2021-01-31 05:15:34.256487 7f81c7bd6700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2 .13/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f81c7bd6700 time 2021-01-31 05:15:34.251323 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/os/bluestore/BlueFS.cc: 1542: FAILED assert(r == 0) ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x562989083ae0] 2: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1704) [0x562989002804] 3: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x2a1) [0x5629890032b1] 4: (BlueRocksWritableFile::Sync()+0x63) [0x56298901d073] 5: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x149) [0x562989409159] 6: (rocksdb::WritableFileWriter::Sync(bool)+0xe8) [0x562989409e28] 7: (rocksdb::DBImpl::WriteToWAL(rocksdb::autovector<rocksdb::WriteThread::Writer*, 8ul> const&, rocksdb::log::Writer*, bool, bool, unsigned long)+0x41a) [0x56298945598a] 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x94b) [0x56298945656b] 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x562989457537] 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0xcf) [0x562988f9253f] 11: (BlueStore::_kv_sync_thread()+0x1c6f) [0x562988f267df] 12: (BlueStore::KVSyncThread::entry()+0xd) [0x562988f6fa6d] 13: (()+0x7ea5) [0x7f81d8ca6ea5] 14: (clone()+0x6d) [0x7f81d7d968dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
a fuller log is in initial-osd.108.cut.txt
Any try of recovery fails with:
2021-02-02 12:53:18.400571 7f5da959dec0 1 bdev(0x55670aad6600 /var/lib/ceph/osd/ceph-108/block) open path /var/lib/ceph/osd/ceph-108/block 2021-02-02 12:53:18.400716 7f5da959dec0 1 bdev(0x55670aad6600 /var/lib/ceph/osd/ceph-108/block) open size 1000203091968 (0xe8e0c00000, 932GiB) block_size 4096 (4KiB) non-rotational 2021-02-02 12:53:18.400731 7f5da959dec0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-108/block size 932GiB 2021-02-02 12:53:18.400758 7f5da959dec0 1 bluefs mount 2021-02-02 12:53:18.401032 7f5da959dec0 1 bluefs _init_alloc id 1 alloc_size 0x10000 size 0xe8e0c00000 2021-02-02 13:21:09.304110 7f5da959dec0 -1 *** Caught signal (Segmentation fault) ** in thread 7f5da959dec0 thread_name:ceph-osd ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable) 1: (()+0xa76e51) [0x5566ff53ce51] 2: (()+0xf630) [0x7f5da6a40630] 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x5566ff4efcd2] 4: (BlueFS::_replay(bool)+0x48d) [0x5566ff50378d] 5: (BlueFS::mount()+0x1d4) [0x5566ff507374] 6: (BlueStore::_open_db(bool)+0x1857) [0x5566ff415827] 7: (BlueStore::_mount(bool)+0x2f6) [0x5566ff44acf6] 8: (OSD::init()+0x472) [0x5566fefe9762] 9: (main()+0x2f49) [0x5566feeea719] 10: (__libc_start_main()+0xf5) [0x7f5da5a4c555] 11: (()+0x4c4383) [0x5566fef8a383] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
a fuller log is in
no-recovery-osd.108.txt
OS is CentOS 7.8.2003
# grep baseurl /etc/yum.repos.d/ceph.repo | head -1 baseurl=http://download.ceph.com/rpm-luminous/el7/$basearch # uname -a Linux dch2-ceph-data-013.ceph.intern.net.de 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 15:46:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
A try to fix one of the offending OSDs results in
# ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-109 *** Caught signal (Segmentation fault) ** in thread 7f7258eb0ec0 thread_name:ceph-bluestore- ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable) 1: (()+0x4025e1) [0x557c65a7e5e1] 2: (()+0xf630) [0x7f724dcb0630] 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x557c65851992] 4: (BlueFS::_replay(bool)+0x48d) [0x557c6586544d] 5: (BlueFS::mount()+0x1d4) [0x557c65869034] 6: (BlueStore::_open_db(bool)+0x1857) [0x557c658c06e7] 7: (BlueStore::_fsck(bool, bool)+0x3c7) [0x557c658f65c7] 8: (main()+0xf24) [0x557c657a8c24] 9: (__libc_start_main()+0xf5) [0x7f724c681555] 10: (()+0x1c444f) [0x557c6584044f] 2021-02-02 17:17:19.152141 7f7258eb0ec0 -1 *** Caught signal (Segmentation fault) **
# gdb ceph-bluestore-tool (gdb) run fsck --path /var/lib/ceph/osd/ceph-109 ... Program received signal SIGSEGV, Segmentation fault. 0x0000555555729992 in BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*) () (gdb) bt #0 0x0000555555729992 in BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*) () #1 0x000055555573d44d in BlueFS::_replay(bool) () #2 0x0000555555741034 in BlueFS::mount() () #3 0x00005555557986e7 in BlueStore::_open_db(bool) () #4 0x00005555557ce5c7 in BlueStore::_fsck(bool, bool) () #5 0x0000555555680c24 in main () (gdb) info locals No symbol table info available. (gdb)
# ceph osd tree ... -37 10.91638 host dch2-ceph-data-012 102 sbs 1.81940 osd.102 up 1.00000 1.00000 103 sbs 1.81940 osd.103 up 1.00000 1.00000 104 sbs 1.81940 osd.104 up 1.00000 1.00000 105 sbs 1.81940 osd.105 up 1.00000 1.00000 106 sbs 1.81940 osd.106 up 1.00000 1.00000 107 sbs 1.81940 osd.107 up 1.00000 1.00000 -40 5.45819 host dch2-ceph-data-013 108 gpssd 0.90970 osd.108 down 0 1.00000 109 gpssd 0.90970 osd.109 down 0 1.00000 110 gpssd 0.90970 osd.110 down 0 1.00000 111 gpssd 0.90970 osd.111 down 0 1.00000 112 gpssd 0.90970 osd.112 down 0 1.00000 ...
I will disconnect the affected servers from my productive cluster tomorrow,
and I can than isolate the servers, and if necessary
reconfigure the network towards a stageing cluster.
Files
Actions