Project

General

Profile

Actions

Bug #42605

closed

KernelDevice.cc: 688: FAILED assert(off % block_size == 0)

Added by 黄 维 over 4 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSD start failed after server power down

ceph version: v12.2.9

stack info:
/clove/vm/zstor/ceph/rpmbuild/BUILD/ceph-12.2.9/src/os/bluestore/KernelDevice.cc: In function 'virtual int KernelDevice::read(uint64_t, uint64_t, ceph::bufferlist*, IOContext*, bool)' thread 7f10ebd5ce40 time 2019-10-19 09:24:23.121071
/clove/vm/zstor/ceph/rpmbuild/BUILD/ceph-12.2.9/src/os/bluestore/*KernelDevice.cc: 688: FAILED assert(off % block_size == 0)*
ceph version 12.2.9-2-34-g8d920dc (8d920dcaaa949f3a08659d9db6e560ccb1896736) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f10e2612080]
2: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::list*, IOContext*, bool)+0x6af) [0x5602e540914f]
3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x646) [0x5602e520bad6]
4: (BlueFS::_replay(bool)+0x6cf) [0x5602e522326f]
5: (BlueFS::mount()+0x1e4) [0x5602e52273d4]
6: (open_bluefs(CephContext*, std::string const&, std::vector<std::string, std::allocator<std::string> > const&)+0x3c2) [0x5602e51fad42]
7: (main()+0x1f50) [0x5602e5164410]
8: (__libc_start_main()+0xf5) [0x7f10dfb10c05]
9: (()+0x1c231f) [0x5602e51fa31f]

Breakpoint 2, BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=off@entry=3883008,
len=len@entry=1077248, outbl=outbl@entry=0x7fffffffb9a0, out=out@entry=0x0) at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:935
935 {
(gdb)
Continuing.

Breakpoint 2, BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=off@entry=4960256,
len=4096, outbl=outbl@entry=0x7fffffffb950, out=out@entry=0x0) at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:935
935 {
(gdb)
Continuing.

Breakpoint 2, BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=off@entry=4964352, //4964352 + 720896 = 5685248
len=len@entry=720896, outbl=outbl@entry=0x7fffffffb9a0, out=out@entry=0x0) at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:935
935 {
(gdb)
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000555555727ac3 in BlueFS::_read (this=this@entry=0x555556684300, h=h@entry=0x5555566ca680, buf=buf@entry=0x5555566ca688, off=5324800,
off@entry=4964352, len=360448, len@entry=720896, outbl=outbl@entry=0x7fffffffb9a0, out=out@entry=0x0)
at /usr/src/debug/ceph-12.2.9/src/os/bluestore/BlueFS.cc:975
975 cct->_conf->bluefs_buffered_io);

$3 = std::vector of length 2, capacity 2 = {{<AllocExtent> = {offset = 1618051072, length = 1130496}, bdev = 1 '\001'}, {<AllocExtent> = {
offset = 1409875968, length = 4194304}, bdev = 0 '\000'}} //*size: 1130496 + 4194304 < 5685248*
(gdb) n
90 if ((int64_t) offset >= p->length) { //5324800 > 1130496
(gdb) p offset
$4 = 5324800
(gdb) p p->length
Attempt to take address of value not located in memory.
(gdb) n
91 offset = p>length;
(gdb) p 5324800 - 1130496
$5 = 4194304
(gdb) n
89 while (p != extents.end()) {
(gdb) n
92 ++p;
(gdb) n
89 while (p != extents.end()) {
(gdb) n
91 offset = p>length;
(gdb) n
89 while (p != extents.end()) {
(gdb) n
92 ++p;
(gdb) n
89 while (p != extents.end()) {
(gdb) p p
$6 = {<AllocExtent> = {offset = 93825012845280, length = 1449961104}, bdev = 85 'U'} *//offset = 93825012845280=85(T) overstep the bound *
(gdb) n
97 *x_off = offset;
(gdb) p offset
$7 = 0
(gdb) n
99 }


Files

ceph-osd.13.log.7z (156 KB) ceph-osd.13.log.7z 黄 维, 11/04/2019 02:49 AM
Actions #2

Updated by Igor Fedotov over 4 years ago

Looks like bluefs replay tries to read an out-of-bound extent (#3 while just 2 are present for log file (aka ino 1) in an attempt to locate log tail. Most probably this is an error in replay logic.

would you be able to attach binary dumps for the following regions to inspect bluefs log content and have a way a repro to verify a fix:

DB device:
0x60718000+114000
WAL device:
0x54090000+400000

Actions #3

Updated by 黄 维 over 4 years ago

Igor Fedotov wrote:

Looks like bluefs replay tries to read an out-of-bound extent (#3 while just 2 are present for log file (aka ino 1) in an attempt to locate log tail. Most probably this is an error in replay logic.

would you be able to attach binary dumps for the following regions to inspect bluefs log content and have a way a repro to verify a fix:

DB device:
0x60718000+114000
WAL device:
0x54090000+400000

Sorry,i cann't. The OSD has been accidentally destroyed.

Actions #4

Updated by 黄 维 over 4 years ago

黄 维 wrote:

Igor Fedotov wrote:

Looks like bluefs replay tries to read an out-of-bound extent (#3 while just 2 are present for log file (aka ino 1) in an attempt to locate log tail. Most probably this is an error in replay logic.

would you be able to attach binary dumps for the following regions to inspect bluefs log content and have a way a repro to verify a fix:

DB device:
0x60718000+114000
WAL device:
0x54090000+400000

Sorry,i can't. The OSD has been accidentally destroyed.

Actions #5

Updated by Igor Fedotov about 1 year ago

  • Status changed from New to Closed

Outdated

Actions

Also available in: Atom PDF