Project

General

Profile

Actions

Bug #52502

closed

src/os/bluestore/BlueStore.cc: FAILED ceph_assert(collection_ref)

Added by Dimitri Savineau over 2 years ago. Updated about 2 years ago.

Status:
Can't reproduce
Priority:
High
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We're seeing some strange behaviour on the OSDs after a node reboot.

This doesn't affect all the OSDs on a cluster or node, but each time we tried to reboot a cluster, some OSDs remain down after the reboot (systemd unit up & running but OSD status down).

Sometime just restarting the systemd unit fix the issue but sometime restarting the OSD triggers an OSD crash.

We have this behaviour at the moment in the ceph-ansible and ceph-volume CI. I don't know if the cephadm CI is testing a reboot scenario.

The issue is present in both containerized and non containerized deployment.

# ceph --version
ceph version 17.0.0-7404-g9c213a0d (9c213a0d08176f61a08275603cca2e8dcd86881e) quincy (dev)

We know for sure that the issue wasn't present on ceph@master until 6436cc5e13174b8b206301b0a073b8a776eea490 (included) which is 1 month old.

We don't see the issue (yet) on the stable release (pacific, octopus and nautilus).

# ceph crash info 2021-09-01T18:56:10.627063Z_49021609-253b-4fb9-a00d-f0880bf1b04b
{
    "assert_condition": "collection_ref",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc",
    "assert_func": "int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)",
    "assert_line": 17679,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)' thread 7fb3adc930c0 time 2021-09-01T18:56:10.611315+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: 17679: FAILED ceph_assert(collection_ref)\n",
    "assert_thread_name": "ceph-osd",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fb3abc39b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x556b66ff205c]",
        "/usr/bin/ceph-osd(+0x5d021f) [0x556b66ff221f]",
        "(BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)+0x71e) [0x556b6763f69e]",
        "(BlueStore::reconstruct_allocations(Allocator*, BlueStore::read_alloc_stats_t&)+0x22c) [0x556b67640adc]",
        "(BlueStore::read_allocation_from_drive_on_startup()+0x192) [0x556b67641232]",
        "(BlueStore::_init_alloc()+0xa01) [0x556b67642301]",
        "(BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x556b6768a6f4]",
        "(BlueStore::_mount()+0x1ae) [0x556b6768d46e]",
        "(OSD::init()+0x3ba) [0x556b6712d5ba]",
        "main()",
        "__libc_start_main()",
        "_start()" 
    ],
    "ceph_version": "17.0.0-7404-g9c213a0d",
    "crash_id": "2021-09-01T18:56:10.627063Z_49021609-253b-4fb9-a00d-f0880bf1b04b",
    "entity_name": "osd.0",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "ff4a164b9a3a2ff220b1330f6d5650bcd8e041893a55e3ae6cb0b0184be0b555",
    "timestamp": "2021-09-01T18:56:10.627063Z",
    "utsname_hostname": "osd0",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.3.1.el8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Jun 1 16:14:33 UTC 2021" 
}

See the full osd crash log in attachment.

A ceph-ansible CI job failure can be found at https://2.jenkins.ceph.com/job/ceph-ansible-prs-centos-non_container-all_daemons/3152/artifact/logs/


Files

osd_crash.log (171 KB) osd_crash.log Dimitri Savineau, 09/03/2021 03:29 PM

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #52138: os/bluestore/BlueStore.cc: FAILED ceph_assert(lcl_extnt_map[offset] == length)ResolvedGabriel BenHanokh

Actions
Actions

Also available in: Atom PDF