Bug #52502
closedsrc/os/bluestore/BlueStore.cc: FAILED ceph_assert(collection_ref)
0%
Description
We're seeing some strange behaviour on the OSDs after a node reboot.
This doesn't affect all the OSDs on a cluster or node, but each time we tried to reboot a cluster, some OSDs remain down after the reboot (systemd unit up & running but OSD status down).
Sometime just restarting the systemd unit fix the issue but sometime restarting the OSD triggers an OSD crash.
We have this behaviour at the moment in the ceph-ansible and ceph-volume CI. I don't know if the cephadm CI is testing a reboot scenario.
The issue is present in both containerized and non containerized deployment.
# ceph --version ceph version 17.0.0-7404-g9c213a0d (9c213a0d08176f61a08275603cca2e8dcd86881e) quincy (dev)
We know for sure that the issue wasn't present on ceph@master until 6436cc5e13174b8b206301b0a073b8a776eea490 (included) which is 1 month old.
We don't see the issue (yet) on the stable release (pacific, octopus and nautilus).
# ceph crash info 2021-09-01T18:56:10.627063Z_49021609-253b-4fb9-a00d-f0880bf1b04b { "assert_condition": "collection_ref", "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc", "assert_func": "int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)", "assert_line": 17679, "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)' thread 7fb3adc930c0 time 2021-09-01T18:56:10.611315+0000\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: 17679: FAILED ceph_assert(collection_ref)\n", "assert_thread_name": "ceph-osd", "backtrace": [ "/lib64/libpthread.so.0(+0x12b20) [0x7fb3abc39b20]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x556b66ff205c]", "/usr/bin/ceph-osd(+0x5d021f) [0x556b66ff221f]", "(BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)+0x71e) [0x556b6763f69e]", "(BlueStore::reconstruct_allocations(Allocator*, BlueStore::read_alloc_stats_t&)+0x22c) [0x556b67640adc]", "(BlueStore::read_allocation_from_drive_on_startup()+0x192) [0x556b67641232]", "(BlueStore::_init_alloc()+0xa01) [0x556b67642301]", "(BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x556b6768a6f4]", "(BlueStore::_mount()+0x1ae) [0x556b6768d46e]", "(OSD::init()+0x3ba) [0x556b6712d5ba]", "main()", "__libc_start_main()", "_start()" ], "ceph_version": "17.0.0-7404-g9c213a0d", "crash_id": "2021-09-01T18:56:10.627063Z_49021609-253b-4fb9-a00d-f0880bf1b04b", "entity_name": "osd.0", "os_id": "centos", "os_name": "CentOS Linux", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "ff4a164b9a3a2ff220b1330f6d5650bcd8e041893a55e3ae6cb0b0184be0b555", "timestamp": "2021-09-01T18:56:10.627063Z", "utsname_hostname": "osd0", "utsname_machine": "x86_64", "utsname_release": "4.18.0-305.3.1.el8.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Tue Jun 1 16:14:33 UTC 2021" }
See the full osd crash log in attachment.
A ceph-ansible CI job failure can be found at https://2.jenkins.ceph.com/job/ceph-ansible-prs-centos-non_container-all_daemons/3152/artifact/logs/
Files
Updated by Neha Ojha over 2 years ago
- Project changed from RADOS to bluestore
- Assignee set to Gabriel BenHanokh
-7> 2021-09-01T18:56:10.603+0000 7fb3adc930c0 5 asok(0x556b697fc000) register_command bluestore allocator fragmentation recovery hook 0x556b6a624600 -6> 2021-09-01T18:56:10.605+0000 7fb3adc930c0 5 bluestore::NCB::reconstruct_allocations::memory_target=4294967296, bdev_size=26839351296 -5> 2021-09-01T18:56:10.605+0000 7fb3adc930c0 5 bluestore::NCB::reconstruct_allocations::init_add_free(0, 26839351296) -4> 2021-09-01T18:56:10.606+0000 7fb3adc930c0 5 bluestore::NCB::reconstruct_allocations::init_rm_free(0, 8192) -3> 2021-09-01T18:56:10.606+0000 7fb3adc930c0 5 bluestore::NCB::reconstruct_allocations::calling read_allocation_from_onodes() -2> 2021-09-01T18:56:10.609+0000 7fb3adc930c0 -1 bluestore::NCB::read_allocation_from_onodes::stray object #-1:3034e826:::osdmap.57:0# not owned by any collection -1> 2021-09-01T18:56:10.618+0000 7fb3adc930c0 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)' thread 7fb3adc930c0 time 2021-09-01T18:56:10.611315+0000 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7404-g9c213a0d/rpm/el8/BUILD/ceph-17.0.0-7404-g9c213a0d/src/os/bluestore/BlueStore.cc: 17679: FAILED ceph_assert(collection_ref) ceph version 17.0.0-7404-g9c213a0d (9c213a0d08176f61a08275603cca2e8dcd86881e) quincy (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x556b66ff1ffe] 2: /usr/bin/ceph-osd(+0x5d021f) [0x556b66ff221f] 3: (BlueStore::read_allocation_from_onodes(Allocator*, BlueStore::read_alloc_stats_t&)+0x71e) [0x556b6763f69e] 4: (BlueStore::reconstruct_allocations(Allocator*, BlueStore::read_alloc_stats_t&)+0x22c) [0x556b67640adc] 5: (BlueStore::read_allocation_from_drive_on_startup()+0x192) [0x556b67641232] 6: (BlueStore::_init_alloc()+0xa01) [0x556b67642301] 7: (BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x556b6768a6f4] 8: (BlueStore::_mount()+0x1ae) [0x556b6768d46e] 9: (OSD::init()+0x3ba) [0x556b6712d5ba] 10: main() 11: __libc_start_main() 12: _start()
Updated by Neha Ojha over 2 years ago
- Subject changed from some OSDs don't come back online after a reboot or segfault on restart to src/os/bluestore/BlueStore.cc: FAILED ceph_assert(collection_ref)
Updated by Guillaume Abrioux over 2 years ago
- Priority changed from Normal to High
increasing the priority since we hit this issue quite a lot in ceph-volume ci
Updated by Sebastian Wagner over 2 years ago
for SEO, this manifests itself as ceph-volume jenkins failures:
_____________ TestOSDs.test_all_osds_are_up_and_in[ansible://osd0] _____________ [gw1] linux -- Python 3.6.8 /tmp/tox.YMG46ctet9/centos8-bluestore-create/bin/python self = <tests.osd.test_osds.TestOSDs object at 0x7f9f97c93278> node = {'address': '192.168.3.100', 'cluster_address': '192.168.4.200', 'cluster_name': 'test', 'conf_path': '/etc/ceph/test.conf', ...} host = <testinfra.host.Host object at 0x7f9f97d4d0b8> def test_all_osds_are_up_and_in(self, node, host): cmd = "sudo ceph --cluster={cluster} --connect-timeout 5 --keyring /var/lib/ceph/bootstrap-osd/{cluster}.keyring -n client.bootstrap-osd osd tree -f json".format( # noqa E501 cluster=node["cluster_name"]) output = json.loads(host.check_output(cmd)) > assert node["num_osds"] == self._get_nb_up_osds_from_ids(node, output) E assert 3 == 2 E -3 E +2
Updated by Sebastian Wagner over 2 years ago
- Related to Bug #52138: os/bluestore/BlueStore.cc: FAILED ceph_assert(lcl_extnt_map[offset] == length) added
Updated by Neha Ojha over 2 years ago
- Status changed from New to Need More Info
Dimitri, are you able to reproduce this issue? We have merged several fixes in the area recently.
Updated by Adam Kupczyk about 2 years ago
- Status changed from Need More Info to Can't reproduce