Project

General

Profile

Actions

Bug #52502

closed

src/os/bluestore/BlueStore.cc: FAILED ceph_assert(collection_ref)

Added by Dimitri Savineau over 2 years ago. Updated about 2 years ago.

Status:
Can't reproduce
Priority:
High
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We're seeing some strange behaviour on the OSDs after a node reboot.

This doesn't affect all the OSDs on a cluster or node, but each time we tried to reboot a cluster, some OSDs remain down after the reboot (systemd unit up & running but OSD status down).

Sometime just restarting the systemd unit fix the issue but sometime restarting the OSD triggers an OSD crash.

We have this behaviour at the moment in the ceph-ansible and ceph-volume CI. I don't know if the cephadm CI is testing a reboot scenario.

The issue is present in both containerized and non containerized deployment.

# ceph --version
ceph version 17.0.0-7404-g9c213a0d (9c213a0d08176f61a08275603cca2e8dcd86881e) quincy (dev)

We know for sure that the issue wasn't present on ceph@master until 6436cc5e13174b8b206301b0a073b8a776eea490 (included) which is 1 month old.

We don't see the issue (yet) on the stable release (pacific, octopus and nautilus).

# ceph crash info 2021-09-01T18:56:10.627063Z_49021609-253b-4fb9-a00d-f0880bf1b04b
{
    "assert_condition": "collection_ref",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc",
    "assert_func": "int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)",
    "assert_line": 17679,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)' thread 7fb3adc930c0 time 2021-09-01T18:56:10.611315+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: 17679: FAILED ceph_assert(collection_ref)\n",
    "assert_thread_name": "ceph-osd",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fb3abc39b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x556b66ff205c]",
        "/usr/bin/ceph-osd(+0x5d021f) [0x556b66ff221f]",
        "(BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)+0x71e) [0x556b6763f69e]",
        "(BlueStore::reconstruct_allocations(Allocator*, BlueStore::read_alloc_stats_t&)+0x22c) [0x556b67640adc]",
        "(BlueStore::read_allocation_from_drive_on_startup()+0x192) [0x556b67641232]",
        "(BlueStore::_init_alloc()+0xa01) [0x556b67642301]",
        "(BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x556b6768a6f4]",
        "(BlueStore::_mount()+0x1ae) [0x556b6768d46e]",
        "(OSD::init()+0x3ba) [0x556b6712d5ba]",
        "main()",
        "__libc_start_main()",
        "_start()" 
    ],
    "ceph_version": "17.0.0-7404-g9c213a0d",
    "crash_id": "2021-09-01T18:56:10.627063Z_49021609-253b-4fb9-a00d-f0880bf1b04b",
    "entity_name": "osd.0",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "ff4a164b9a3a2ff220b1330f6d5650bcd8e041893a55e3ae6cb0b0184be0b555",
    "timestamp": "2021-09-01T18:56:10.627063Z",
    "utsname_hostname": "osd0",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.3.1.el8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Jun 1 16:14:33 UTC 2021" 
}

See the full osd crash log in attachment.

A ceph-ansible CI job failure can be found at https://2.jenkins.ceph.com/job/ceph-ansible-prs-centos-non_container-all_daemons/3152/artifact/logs/


Files

osd_crash.log (171 KB) osd_crash.log Dimitri Savineau, 09/03/2021 03:29 PM

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #52138: os/bluestore/BlueStore.cc: FAILED ceph_assert(lcl_extnt_map[offset] == length)ResolvedGabriel BenHanokh

Actions
Actions #1

Updated by Neha Ojha over 2 years ago

  • Project changed from RADOS to bluestore
  • Assignee set to Gabriel BenHanokh
    -7> 2021-09-01T18:56:10.603+0000 7fb3adc930c0  5 asok(0x556b697fc000) register_command bluestore allocator fragmentation recovery hook 0x556b6a624600
    -6> 2021-09-01T18:56:10.605+0000 7fb3adc930c0  5 bluestore::NCB::reconstruct_allocations::memory_target=4294967296, bdev_size=26839351296
    -5> 2021-09-01T18:56:10.605+0000 7fb3adc930c0  5 bluestore::NCB::reconstruct_allocations::init_add_free(0, 26839351296)
    -4> 2021-09-01T18:56:10.606+0000 7fb3adc930c0  5 bluestore::NCB::reconstruct_allocations::init_rm_free(0, 8192)
    -3> 2021-09-01T18:56:10.606+0000 7fb3adc930c0  5 bluestore::NCB::reconstruct_allocations::calling read_allocation_from_onodes()
    -2> 2021-09-01T18:56:10.609+0000 7fb3adc930c0 -1 bluestore::NCB::read_allocation_from_onodes::stray object #-1:3034e826:::osdmap.57:0# not owned by any collection
    -1> 2021-09-01T18:56:10.618+0000 7fb3adc930c0 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)' thread 7fb3adc930c0 time 2021-09-01T18:56:10.611315+0000
/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: 17679: FAILED ceph_assert(collection_ref)

 ceph version 17.0.0-7404-g9c213a0d (9c213a0d08176f61a08275603cca2e8dcd86881e) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x556b66ff1ffe]
 2: /usr/bin/ceph-osd(+0x5d021f) [0x556b66ff221f]
 3: (BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)+0x71e) [0x556b6763f69e]
 4: (BlueStore::reconstruct_allocations(Allocator*, BlueStore::read_alloc_stats_t&)+0x22c) [0x556b67640adc]
 5: (BlueStore::read_allocation_from_drive_on_startup()+0x192) [0x556b67641232]
 6: (BlueStore::_init_alloc()+0xa01) [0x556b67642301]
 7: (BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x556b6768a6f4]
 8: (BlueStore::_mount()+0x1ae) [0x556b6768d46e]
 9: (OSD::init()+0x3ba) [0x556b6712d5ba]
 10: main()
 11: __libc_start_main()
 12: _start()
Actions #2

Updated by Neha Ojha over 2 years ago

  • Subject changed from some OSDs don't come back online after a reboot or segfault on restart to src/os/bluestore/BlueStore.cc: FAILED ceph_assert(collection_ref)
Actions #3

Updated by Guillaume Abrioux over 2 years ago

  • Priority changed from Normal to High

increasing the priority since we hit this issue quite a lot in ceph-volume ci

Actions #4

Updated by Sebastian Wagner over 2 years ago

for SEO, this manifests itself as ceph-volume jenkins failures:

https://jenkins.ceph.com/job/ceph-volume-prs-lvm-centos8-bluestore-create/258/consoleFull#-245458114c19247c4-fcb7-4c61-9a5d-7e2b9731c678

_____________ TestOSDs.test_all_osds_are_up_and_in[ansible://osd0] _____________
[gw1] linux -- Python 3.6.8 /tmp/tox.YMG46ctet9/centos8-bluestore-create/bin/python

self = <tests.osd.test_osds.TestOSDs object at 0x7f9f97c93278>
node = {'address': '192.168.3.100', 'cluster_address': '192.168.4.200', 'cluster_name': 'test', 'conf_path': '/etc/ceph/test.conf', ...}
host = <testinfra.host.Host object at 0x7f9f97d4d0b8>

    def test_all_osds_are_up_and_in(self, node, host):
        cmd = "sudo ceph --cluster={cluster} --connect-timeout 5 --keyring /var/lib/ceph/bootstrap-osd/{cluster}.keyring -n client.bootstrap-osd osd tree -f json".format(  # noqa E501
            cluster=node["cluster_name"])
        output = json.loads(host.check_output(cmd))
>       assert node["num_osds"] == self._get_nb_up_osds_from_ids(node, output)
E       assert 3 == 2
E         -3
E         +2
Actions #5

Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #52138: os/bluestore/BlueStore.cc: FAILED ceph_assert(lcl_extnt_map[offset] == length) added
Actions #6

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info

Dimitri, are you able to reproduce this issue? We have merged several fixes in the area recently.

Actions #7

Updated by Adam Kupczyk about 2 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF