Bug #46800: Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len)) - bluestore - Ceph

Actions

Copy link

Bug #46800

closed

Bug #47751: Hybrid allocator might segfault when fallback allocator is present

Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len))

Added by Vitaliy Filippov over 3 years ago. Updated over 3 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v15.2.4

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

One of my OSDs just died trying to write beyond the end of the device. Now it just fails to start with the same assertion during _deferred_replay().

LVM volume size is 0x37400000000, Bluestore device size is also 0x37400000000 according to `ceph-bluestore-tool bluefs-bdev-sizes -h --path /var/lib/ceph/osd/ceph-2/`, but when I attached to the OSD with gdb and looked at the `off` (offset) parameter in BlockDevice::is_valid_io it was 0x3740003e000.

So the OSD was indeed trying to write beyond the device end.

You can find an excerpt from the OSD log in the attachment. It starts with the initial stack trace when it crashed and then there's a number of repeated startup errors.

The assertion message is like:

/build/ceph-15.2.4/src/os/bluestore/KernelDevice.cc: 892: FAILED ceph_assert(is_valid_io(off, len))

I've backed up BlueFS of this OSD to a directory with bluefs-export for possible future reference... Now I'll probably recreate the OSD and pray that others don't die during backfill because I don't want a Cloudmouse here... which is especially important because of my EC 2+1. :-)

Files

Download all files

ceph-osd.2-excerpt.log (452 KB) ceph-osd.2-excerpt.log		Vitaliy Filippov, 07/31/2020 01:53 PM
fsck.log (18 KB) fsck.log		Vitaliy Filippov, 07/31/2020 03:32 PM