Project

General

Profile

Bug #46800

Bug #47751: Hybrid allocator might segfault when fallback allocator is present

Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len))

Added by Vitaliy Filippov over 3 years ago. Updated about 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi

One of my OSDs just died trying to write beyond the end of the device. Now it just fails to start with the same assertion during _deferred_replay().

LVM volume size is 0x37400000000, Bluestore device size is also 0x37400000000 according to `ceph-bluestore-tool bluefs-bdev-sizes -h --path /var/lib/ceph/osd/ceph-2/`, but when I attached to the OSD with gdb and looked at the `off` (offset) parameter in BlockDevice::is_valid_io it was 0x3740003e000.

So the OSD was indeed trying to write beyond the device end.

You can find an excerpt from the OSD log in the attachment. It starts with the initial stack trace when it crashed and then there's a number of repeated startup errors.

The assertion message is like:

/build/ceph-15.2.4/src/os/bluestore/KernelDevice.cc: 892: FAILED ceph_assert(is_valid_io(off, len))

I've backed up BlueFS of this OSD to a directory with bluefs-export for possible future reference... Now I'll probably recreate the OSD and pray that others don't die during backfill because I don't want a Cloudmouse here... which is especially important because of my EC 2+1. :-)

ceph-osd.2-excerpt.log View (452 KB) Vitaliy Filippov, 07/31/2020 01:53 PM

fsck.log View (18 KB) Vitaliy Filippov, 07/31/2020 03:32 PM


Related issues

Related to bluestore - Bug #48276: OSD Crash with ceph_assert(is_valid_io(off, len)) Duplicate

History

#1 Updated by Igor Fedotov over 3 years ago

Before OSD redeployment could you please set debug-bluestore to 20 and restart OSD and collect OSD log...

#2 Updated by Vitaliy Filippov over 3 years ago

Sorry, too late... :)

#3 Updated by Vitaliy Filippov over 3 years ago

Oh, I have one more thing to add. I tried to ran fsck and it gave a slightly different error message

#4 Updated by Igor Fedotov over 3 years ago

:(
Failure in fsck.log is IMO just another effect from the same issue.

One more question - do you have default settings for bluestore allocator = hybrid ? Or something custom?

#5 Updated by Vitaliy Filippov over 3 years ago

I had bluestore_min_alloc_size_ssd = 4096 before it became the default, but it seems that's all

I've copied RocksDB from this OSD prior to destroying it so if you're interested in something specific I can try to dig for it

#6 Updated by Igor Fedotov over 3 years ago

  • Related to Bug #48276: OSD Crash with ceph_assert(is_valid_io(off, len)) added

#7 Updated by Igor Fedotov about 3 years ago

  • Status changed from New to Duplicate
  • Parent task set to #47751

Also available in: Atom PDF