Project

General

Profile

Actions

Bug #49110

closed

BlueFS.cc: 1542: FAILED assert(r == 0)

Added by Andreas Buschmann over 3 years ago. Updated about 1 year ago.

Status:
Won't Fix
Priority:
Low
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All the SSD based OSDs in my ceph cluster crashed.

The initial error was:

2021-01-31 05:15:21.130636 7f81c7bd6700  1 bluefs add_block_extent bdev 1 0xe8db800000~300000
2021-01-31 05:15:34.251287 7f81c7bd6700  1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x3c0000; fallback to bdev 2
2021-01-31 05:15:34.251309 7f81c7bd6700 -1 bluefs _allocate failed to allocate 0x on bdev 2, dne
2021-01-31 05:15:34.256487 7f81c7bd6700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2
.13/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, uint64_t, uint64_t)' thread 7f81c7bd6700 time 2021-01-31 05:15:34.251323
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/os/bluestore/BlueFS.cc: 1542: FAILED assert(r == 0)

 ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x562989083ae0]
 2: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1704) [0x562989002804]
 3: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x2a1) [0x5629890032b1]
 4: (BlueRocksWritableFile::Sync()+0x63) [0x56298901d073]
 5: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x149) [0x562989409159]
 6: (rocksdb::WritableFileWriter::Sync(bool)+0xe8) [0x562989409e28]
 7: (rocksdb::DBImpl::WriteToWAL(rocksdb::autovector<rocksdb::WriteThread::Writer*, 8ul> const&, rocksdb::log::Writer*, bool, bool, unsigned long)+0x41a) [0x56298945598a]
 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x94b) [0x56298945656b]
 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x562989457537]
 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0xcf) [0x562988f9253f]
 11: (BlueStore::_kv_sync_thread()+0x1c6f) [0x562988f267df]
 12: (BlueStore::KVSyncThread::entry()+0xd) [0x562988f6fa6d]
 13: (()+0x7ea5) [0x7f81d8ca6ea5]
 14: (clone()+0x6d) [0x7f81d7d968dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

a fuller log is in initial-osd.108.cut.txt

Any try of recovery fails with:

2021-02-02 12:53:18.400571 7f5da959dec0  1 bdev(0x55670aad6600 /var/lib/ceph/osd/ceph-108/block) open path /var/lib/ceph/osd/ceph-108/block
2021-02-02 12:53:18.400716 7f5da959dec0  1 bdev(0x55670aad6600 /var/lib/ceph/osd/ceph-108/block) open size 1000203091968 (0xe8e0c00000, 932GiB) block_size 4096 (4KiB) non-rotational
2021-02-02 12:53:18.400731 7f5da959dec0  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-108/block size 932GiB
2021-02-02 12:53:18.400758 7f5da959dec0  1 bluefs mount
2021-02-02 12:53:18.401032 7f5da959dec0  1 bluefs _init_alloc id 1 alloc_size 0x10000 size 0xe8e0c00000
2021-02-02 13:21:09.304110 7f5da959dec0 -1 *** Caught signal (Segmentation fault) **
 in thread 7f5da959dec0 thread_name:ceph-osd

 ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)
 1: (()+0xa76e51) [0x5566ff53ce51]
 2: (()+0xf630) [0x7f5da6a40630]
 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x5566ff4efcd2]
 4: (BlueFS::_replay(bool)+0x48d) [0x5566ff50378d]
 5: (BlueFS::mount()+0x1d4) [0x5566ff507374]
 6: (BlueStore::_open_db(bool)+0x1857) [0x5566ff415827]
 7: (BlueStore::_mount(bool)+0x2f6) [0x5566ff44acf6]
 8: (OSD::init()+0x472) [0x5566fefe9762]
 9: (main()+0x2f49) [0x5566feeea719]
 10: (__libc_start_main()+0xf5) [0x7f5da5a4c555]
 11: (()+0x4c4383) [0x5566fef8a383]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

a fuller log is in no-recovery-osd.108.txt

OS is CentOS 7.8.2003

# grep baseurl /etc/yum.repos.d/ceph.repo | head -1
baseurl=http://download.ceph.com/rpm-luminous/el7/$basearch
# uname -a
Linux dch2-ceph-data-013.ceph.intern.net.de 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 15:46:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

A try to fix one of the offending OSDs results in

# ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-109
*** Caught signal (Segmentation fault) **
 in thread 7f7258eb0ec0 thread_name:ceph-bluestore-
 ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)
 1: (()+0x4025e1) [0x557c65a7e5e1]
 2: (()+0xf630) [0x7f724dcb0630]
 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x452) [0x557c65851992]
 4: (BlueFS::_replay(bool)+0x48d) [0x557c6586544d]
 5: (BlueFS::mount()+0x1d4) [0x557c65869034]
 6: (BlueStore::_open_db(bool)+0x1857) [0x557c658c06e7]
 7: (BlueStore::_fsck(bool, bool)+0x3c7) [0x557c658f65c7]
 8: (main()+0xf24) [0x557c657a8c24]
 9: (__libc_start_main()+0xf5) [0x7f724c681555]
 10: (()+0x1c444f) [0x557c6584044f]
2021-02-02 17:17:19.152141 7f7258eb0ec0 -1 *** Caught signal (Segmentation fault) **

# gdb ceph-bluestore-tool
(gdb) run fsck --path /var/lib/ceph/osd/ceph-109
...
Program received signal SIGSEGV, Segmentation fault.
0x0000555555729992 in BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*) ()
(gdb) bt
#0  0x0000555555729992 in BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*) ()
#1  0x000055555573d44d in BlueFS::_replay(bool) ()
#2  0x0000555555741034 in BlueFS::mount() ()
#3  0x00005555557986e7 in BlueStore::_open_db(bool) ()
#4  0x00005555557ce5c7 in BlueStore::_fsck(bool, bool) ()
#5  0x0000555555680c24 in main ()
(gdb) info locals
No symbol table info available.
(gdb) 

# ceph osd tree
...
-37        10.91638     host dch2-ceph-data-012                         
102   sbs   1.81940         osd.102                 up  1.00000 1.00000 
103   sbs   1.81940         osd.103                 up  1.00000 1.00000 
104   sbs   1.81940         osd.104                 up  1.00000 1.00000 
105   sbs   1.81940         osd.105                 up  1.00000 1.00000 
106   sbs   1.81940         osd.106                 up  1.00000 1.00000 
107   sbs   1.81940         osd.107                 up  1.00000 1.00000 
-40         5.45819     host dch2-ceph-data-013                         
108 gpssd   0.90970         osd.108               down        0 1.00000 
109 gpssd   0.90970         osd.109               down        0 1.00000 
110 gpssd   0.90970         osd.110               down        0 1.00000 
111 gpssd   0.90970         osd.111               down        0 1.00000 
112 gpssd   0.90970         osd.112               down        0 1.00000 
...

I will disconnect the affected servers from my productive cluster tomorrow,
and I can than isolate the servers, and if necessary
reconfigure the network towards a stageing cluster.


Files

initial-osd.108.cut.txt (482 KB) initial-osd.108.cut.txt initial problem Andreas Buschmann, 02/02/2021 04:46 PM
no-recovery-osd.108.txt (14 KB) no-recovery-osd.108.txt a try to recover fails Andreas Buschmann, 02/02/2021 04:56 PM
ceph.conf (890 Bytes) ceph.conf Andreas Buschmann, 02/02/2021 05:28 PM
ceph-osd.108.log.00.xz (686 KB) ceph-osd.108.log.00.xz 01 of 11 Andreas Buschmann, 02/07/2021 09:20 PM
ceph-osd.108.log.01.xz (688 KB) ceph-osd.108.log.01.xz 02 of 11 Andreas Buschmann, 02/07/2021 09:20 PM
ceph-osd.108.log.02.xz (691 KB) ceph-osd.108.log.02.xz 03 of 11 Andreas Buschmann, 02/07/2021 09:20 PM
ceph-osd.108.log.03.xz (700 KB) ceph-osd.108.log.03.xz 04 of 11 Andreas Buschmann, 02/07/2021 09:20 PM
ceph-osd.108.log.04.xz (693 KB) ceph-osd.108.log.04.xz 05 of 11 Andreas Buschmann, 02/07/2021 09:21 PM
ceph-osd.108.log.05.xz (697 KB) ceph-osd.108.log.05.xz 06 of 11 Andreas Buschmann, 02/07/2021 09:21 PM
ceph-osd.108.log.06.xz (688 KB) ceph-osd.108.log.06.xz 07 of 11 Andreas Buschmann, 02/07/2021 09:21 PM
ceph-osd.108.log.07.xz (693 KB) ceph-osd.108.log.07.xz 08 of 11 Andreas Buschmann, 02/07/2021 09:22 PM
ceph-osd.108.log.08.xz (696 KB) ceph-osd.108.log.08.xz 09 of 11 Andreas Buschmann, 02/07/2021 09:22 PM
ceph-osd.108.log.09.xz (702 KB) ceph-osd.108.log.09.xz 10 of 11 Andreas Buschmann, 02/07/2021 09:23 PM
ceph-osd.108.log.10.xz (154 KB) ceph-osd.108.log.10.xz 11 of 11 Andreas Buschmann, 02/07/2021 09:26 PM
Actions

Also available in: Atom PDF