Bug #46800
closedBug #47751: Hybrid allocator might segfault when fallback allocator is present
Octopus OSD died and fails to start with FAILED ceph_assert(is_valid_io(off, len))
0%
Description
Hi
One of my OSDs just died trying to write beyond the end of the device. Now it just fails to start with the same assertion during _deferred_replay().
LVM volume size is 0x37400000000, Bluestore device size is also 0x37400000000 according to `ceph-bluestore-tool bluefs-bdev-sizes -h --path /var/lib/ceph/osd/ceph-2/`, but when I attached to the OSD with gdb and looked at the `off` (offset) parameter in BlockDevice::is_valid_io it was 0x3740003e000.
So the OSD was indeed trying to write beyond the device end.
You can find an excerpt from the OSD log in the attachment. It starts with the initial stack trace when it crashed and then there's a number of repeated startup errors.
The assertion message is like:
/build/ceph-15.2.4/src/os/bluestore/KernelDevice.cc: 892: FAILED ceph_assert(is_valid_io(off, len))
I've backed up BlueFS of this OSD to a directory with bluefs-export for possible future reference... Now I'll probably recreate the OSD and pray that others don't die during backfill because I don't want a Cloudmouse here... which is especially important because of my EC 2+1. :-)
Files