Bug #42605
closedKernelDevice.cc: 688: FAILED assert(off % block_size == 0)
0%
Description
OSD start failed after server power down
ceph version: v12.2.9
stack info:
/clove/vm/zstor/ceph/rpmbuild/BUILD/ceph-12.2.9/src/os/bluestore/KernelDevice.cc: In function 'virtual int KernelDevice::read(uint64_t, uint64_t, ceph::bufferlist*, IOContext*, bool)' thread 7f10ebd5ce40 time 2019-10-19 09:24:23.121071
/clove/vm/zstor/ceph/rpmbuild/BUILD/ceph-12.2.9/src/os/bluestore/*KernelDevice.cc: 688: FAILED assert(off % block_size == 0)*
ceph version 12.2.9-2-34-g8d920dc (8d920dcaaa949f3a08659d9db6e560ccb1896736) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f10e2612080]
2: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::list*, IOContext*, bool)+0x6af) [0x5602e540914f]
3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x646) [0x5602e520bad6]
4: (BlueFS::_replay(bool)+0x6cf) [0x5602e522326f]
5: (BlueFS::mount()+0x1e4) [0x5602e52273d4]
6: (open_bluefs(CephContext*, std::string const&, std::vector<std::string, std::allocator<std::string> > const&)+0x3c2) [0x5602e51fad42]
7: (main()+0x1f50) [0x5602e5164410]
8: (__libc_start_main()+0xf5) [0x7f10dfb10c05]
9: (()+0x1c231f) [0x5602e51fa31f]
Breakpoint 2, BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=off@entry=3883008,
len=len@entry=1077248, outbl=outbl@entry=0x7fffffffb9a0, out=out@entry=0x0) at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:935
935 {
(gdb)
Continuing.
Breakpoint 2, BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=off@entry=4960256,
len=4096, outbl=outbl@entry=0x7fffffffb950, out=out@entry=0x0) at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:935
935 {
(gdb)
Continuing.
Breakpoint 2, BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=off@entry=4964352, //4964352 + 720896 = 5685248
len=len@entry=720896, outbl=outbl@entry=0x7fffffffb9a0, out=out@entry=0x0) at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:935
935 {
(gdb)
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x0000555555727ac3 in BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=5324800,
off@entry=4964352, len=360448, len@entry=720896, outbl=outbl@entry=0x7fffffffb9a0, out=out@entry=0x0)
at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:975
975 cct->_conf->bluefs_buffered_io);
$3 = std::vector of length 2, capacity 2 = {{<AllocExtent> = {offset = 1618051072, length = 1130496}, bdev = 1 '\001'}, {<AllocExtent> = {
offset = 1409875968, length = 4194304}, bdev = 0 '\000'}} //*size: 1130496 + 4194304 < 5685248*
(gdb) n
90 if ((int64_t) offset >= p->length) { //5324800 > 1130496
(gdb) p offset
$4 = 5324800
(gdb) p p->length
Attempt to take address of value not located in memory.
(gdb) n
91 offset = p>length;
(gdb) p 5324800 - 1130496
$5 = 4194304
(gdb) n
89 while (p != extents.end()) {
(gdb) n
92 ++p;
(gdb) n
89 while (p != extents.end()) {
(gdb) n
91 offset = p>length;
(gdb) n
89 while (p != extents.end()) {
(gdb) n
92 ++p;
(gdb) n
89 while (p != extents.end()) {
(gdb) p p
$6 = {<AllocExtent> = {offset = 93825012845280, length = 1449961104}, bdev = 85 'U'} *//offset = 93825012845280=85(T) overstep the bound *
(gdb) n
97 *x_off = offset;
(gdb) p offset
$7 = 0
(gdb) n
99 }
Files
Updated by 黄 维 over 4 years ago
- File ceph-osd.13.log.7z ceph-osd.13.log.7z added
Updated by Igor Fedotov over 4 years ago
Looks like bluefs replay tries to read an out-of-bound extent (#3 while just 2 are present for log file (aka ino 1) in an attempt to locate log tail. Most probably this is an error in replay logic.
would you be able to attach binary dumps for the following regions to inspect bluefs log content and have a way a repro to verify a fix:
DB device:
0x60718000+114000
WAL device:
0x54090000+400000
Updated by 黄 维 over 4 years ago
Igor Fedotov wrote:
Looks like bluefs replay tries to read an out-of-bound extent (#3 while just 2 are present for log file (aka ino 1) in an attempt to locate log tail. Most probably this is an error in replay logic.
would you be able to attach binary dumps for the following regions to inspect bluefs log content and have a way a repro to verify a fix:
DB device:
0x60718000+114000
WAL device:
0x54090000+400000
Sorry,i cann't. The OSD has been accidentally destroyed.
Updated by 黄 维 over 4 years ago
黄 维 wrote:
Igor Fedotov wrote:
Looks like bluefs replay tries to read an out-of-bound extent (#3 while just 2 are present for log file (aka ino 1) in an attempt to locate log tail. Most probably this is an error in replay logic.
would you be able to attach binary dumps for the following regions to inspect bluefs log content and have a way a repro to verify a fix:
DB device:
0x60718000+114000
WAL device:
0x54090000+400000Sorry,i can't. The OSD has been accidentally destroyed.