Project

General

Profile

Bug #52464

FAILED ceph_assert(current_shard->second->valid())

Added by Jeff Layton over 1 year ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've got a cephadm cluster I use for testing, and this morning one of the OSDs crashed down in bluestore code:

Aug 31 09:51:40 cephadm2 ceph-osd[20497]: get compressor snappy = 0x55b3c18b1b90
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: bluestore(/var/lib/ceph/osd/ceph-0) _open_fm::NCB::freelist_type=null
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: freelist init
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: freelist _read_cfg
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: asok(0x55b3c09f0000) register_command bluestore allocator dump block hook 0x55b3c18b1ef0
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: asok(0x55b3c09f0000) register_command bluestore allocator score block hook 0x55b3c18b1ef0
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: asok(0x55b3c09f0000) register_command bluestore allocator fragmentation block hook 0x55b3c18b1ef0
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: bluestore::NCB::restore_allocator::file_size=0,sizeof(extent_t)=16
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: bluestore::NCB::restore_allocator::No Valid allocation info on disk (empty file)
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc::NCB::restore_allocator() failed!
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc::NCB::Run Full Recovery from ONodes (might take a while) ...
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: bluestore::NCB::read_allocation_from_drive_on_startup::Start Allocation Recovery from ONodes ...
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7195-g7e7326c4/rpm/el8/BUILD/ceph-17.0.0-7195-g7e7326c4/src/kv/RocksDBStore.cc: In function 'bool WholeMergeIteratorImpl::is_main_smaller()' thread 7f2d60f480c0 time 2021-08-31T13:51:40.899594+0000
                                          /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7195-g7e7326c4/rpm/el8/BUILD/ceph-17.0.0-7195-g7e7326c4/src/kv/RocksDBStore.cc: 2288: FAILED ceph_assert(current_shard->second->valid())

                                           ceph version 17.0.0-7195-g7e7326c4 (7e7326c4231f614aff0f7bef4d72beadce6a9c75) quincy (dev)
                                           1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55b3bdcb0b50]
                                           2: /usr/bin/ceph-osd(+0x5ced71) [0x55b3bdcb0d71]
                                           3: (WholeMergeIteratorImpl::is_main_smaller()+0x13b) [0x55b3be8f93db]
                                           4: (WholeMergeIteratorImpl::next()+0x2c) [0x55b3be8f942c]
                                           5: (BlueStore::_open_collections()+0x660) [0x55b3be2e67f0]
                                           6: (BlueStore::read_allocation_from_drive_on_startup()+0x127) [0x55b3be2ffa97]
                                           7: (BlueStore::_init_alloc()+0xa01) [0x55b3be300bd1]
                                           8: (BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x55b3be3487e4]
                                           9: (BlueStore::_mount()+0x1ae) [0x55b3be34b55e]
                                           10: (OSD::init()+0x3ba) [0x55b3bddec0ba]
                                           11: main()
                                           12: __libc_start_main()
                                           13: _start()
Aug 31 09:51:40 cephadm2 ceph-osd[20497]: *** Caught signal (Aborted) **
                                           in thread 7f2d60f480c0 thread_name:ceph-osd

                                           ceph version 17.0.0-7195-g7e7326c4 (7e7326c4231f614aff0f7bef4d72beadce6a9c75) quincy (dev)
                                           1: /lib64/libpthread.so.0(+0x12b20) [0x7f2d5eeeeb20]
                                           2: gsignal()
                                           3: abort()
                                           4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x55b3bdcb0bae]
                                           5: /usr/bin/ceph-osd(+0x5ced71) [0x55b3bdcb0d71]
                                           6: (WholeMergeIteratorImpl::is_main_smaller()+0x13b) [0x55b3be8f93db]
                                           7: (WholeMergeIteratorImpl::next()+0x2c) [0x55b3be8f942c]
                                           8: (BlueStore::_open_collections()+0x660) [0x55b3be2e67f0]
                                           9: (BlueStore::read_allocation_from_drive_on_startup()+0x127) [0x55b3be2ffa97]
                                           10: (BlueStore::_init_alloc()+0xa01) [0x55b3be300bd1]
                                           11: (BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x55b3be3487e4]
                                           12: (BlueStore::_mount()+0x1ae) [0x55b3be34b55e]
                                           13: (OSD::init()+0x3ba) [0x55b3bddec0ba]
                                           14: main()
                                           15: __libc_start_main()
                                           16: _start()
                                           NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 31 09:51:40 cephadm2 conmon[20474]:     -5> 2021-08-31T13:51:40.897+0000 7f2d60f480c0 -1 bluestore::NCB::restore_allocator::No Valid allocation info on disk (empty file)
Aug 31 09:51:40 cephadm2 conmon[20474]:     -1> 2021-08-31T13:51:40.903+0000 7f2d60f480c0 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7195-g7e7326c4/rpm/el8/BUILD/ceph-17.0.0-7195-g7e7326c4/src/kv/RocksDBStore.cc: In function 'bool WholeMergeIteratorImpl::is_main_smaller()' thread 7f2d60f480c0 time 2021-08-31T13:51:40.899594+0000
Aug 31 09:51:40 cephadm2 conmon[20474]: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7195-g7e7326c4/rpm/el8/BUILD/ceph-17.0.0-7195-g7e7326c4/src/kv/RocksDBStore.cc: 2288: FAILED ceph_assert(current_shard->second->valid())
Aug 31 09:51:40 cephadm2 conmon[20474]: 
Aug 31 09:51:40 cephadm2 conmon[20474]:  ceph version 17.0.0-7195-g7e7326c4 (7e7326c4231f614aff0f7bef4d72beadce6a9c75) quincy (dev)
Aug 31 09:51:40 cephadm2 conmon[20474]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55b3bdcb0b50]
Aug 31 09:51:40 cephadm2 conmon[20474]:  2: /usr/bin/ceph-osd(+0x5ced71) [0x55b3bdcb0d71]
Aug 31 09:51:40 cephadm2 conmon[20474]:  3: (WholeMergeIteratorImpl::is_main_smaller()+0x13b) [0x55b3be8f93db]
Aug 31 09:51:40 cephadm2 conmon[20474]:  4: (WholeMergeIteratorImpl::next()+0x2c) [0x55b3be8f942c]
Aug 31 09:51:40 cephadm2 conmon[20474]:  5: (BlueStore::_open_collections()+0x660) [0x55b3be2e67f0]
Aug 31 09:51:40 cephadm2 conmon[20474]:  6: (BlueStore::read_allocation_from_drive_on_startup()+0x127) [0x55b3be2ffa97]
Aug 31 09:51:40 cephadm2 conmon[20474]:  7: (BlueStore::_init_alloc()+0xa01) [0x55b3be300bd1]
Aug 31 09:51:40 cephadm2 conmon[20474]:  8: (BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x55b3be3487e4]
Aug 31 09:51:40 cephadm2 conmon[20474]:  9: (BlueStore::_mount()+0x1ae) [0x55b3be34b55e]
Aug 31 09:51:40 cephadm2 conmon[20474]:  10: (OSD::init()+0x3ba) [0x55b3bddec0ba]
Aug 31 09:51:40 cephadm2 conmon[20474]:  11: main()
Aug 31 09:51:40 cephadm2 conmon[20474]:  12: __libc_start_main()
Aug 31 09:51:40 cephadm2 conmon[20474]:  13: _start()
Aug 31 09:51:40 cephadm2 conmon[20474]: 
Aug 31 09:51:40 cephadm2 conmon[20474]:      0> 2021-08-31T13:51:40.907+0000 7f2d60f480c0 -1 *** Caught signal (Aborted) **
Aug 31 09:51:40 cephadm2 conmon[20474]:  in thread 7f2d60f480c0 thread_name:ceph-osd
Aug 31 09:51:40 cephadm2 conmon[20474]: 
Aug 31 09:51:40 cephadm2 conmon[20474]:  ceph version 17.0.0-7195-g7e7326c4 (7e7326c4231f614aff0f7bef4d72beadce6a9c75) quincy (dev)
Aug 31 09:51:40 cephadm2 conmon[20474]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f2d5eeeeb20]
Aug 31 09:51:40 cephadm2 conmon[20474]:  2: gsignal()
Aug 31 09:51:40 cephadm2 conmon[20474]:  3: abort()
Aug 31 09:51:40 cephadm2 conmon[20474]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x55b3bdcb0bae]
Aug 31 09:51:40 cephadm2 conmon[20474]:  5: /usr/bin/ceph-osd(+0x5ced71) [0x55b3bdcb0d71]
Aug 31 09:51:40 cephadm2 conmon[20474]:  6: (WholeMergeIteratorImpl::is_main_smaller()+0x13b) [0x55b3be8f93db]
Aug 31 09:51:40 cephadm2 conmon[20474]:  7: (WholeMergeIteratorImpl::next()+0x2c) [0x55b3be8f942c]
Aug 31 09:51:40 cephadm2 conmon[20474]:  8: (BlueStore::_open_collections()+0x660) [0x55b3be2e67f0]
Aug 31 09:51:40 cephadm2 conmon[20474]:  9: (BlueStore::read_allocation_from_drive_on_startup()+0x127) [0x55b3be2ffa97]
Aug 31 09:51:40 cephadm2 conmon[20474]:  10: (BlueStore::_init_alloc()+0xa01) [0x55b3be300bd1]
Aug 31 09:51:40 cephadm2 conmon[20474]:  11: (BlueStore::_open_db_and_around(bool, bool)+0x2f4) [0x55b3be3487e4]
Aug 31 09:51:40 cephadm2 conmon[20474]:  12: (BlueStore::_mount()+0x1ae) [0x55b3be34b55e]
Aug 31 09:51:40 cephadm2 conmon[20474]:  13: (OSD::init()+0x3ba) [0x55b3bddec0ba]
Aug 31 09:51:40 cephadm2 conmon[20474]:  14: main()
Aug 31 09:51:40 cephadm2 conmon[20474]:  15: __libc_start_main()
Aug 31 09:51:40 cephadm2 conmon[20474]:  16: _start()
Aug 31 09:51:40 cephadm2 conmon[20474]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 31 09:51:40 cephadm2 conmon[20474]: 
Aug 31 09:51:41 cephadm2 systemd-coredump[20743]: Process 20497 (ceph-osd) of user 167 dumped core.
Aug 31 09:51:41 cephadm2 systemd[1]: ceph-1d11c63a-09ac-11ec-83e1-52540031ba78@osd.0.service: Main process exited, code=exited, status=134/n/a
Aug 31 09:51:42 cephadm2 systemd[1]: ceph-1d11c63a-09ac-11ec-83e1-52540031ba78@osd.0.service: Failed with result 'exit-code'.
Aug 31 09:51:52 cephadm2 systemd[1]: ceph-1d11c63a-09ac-11ec-83e1-52540031ba78@osd.0.service: Service RestartSec=10s expired, scheduling restart.
Aug 31 09:51:52 cephadm2 systemd[1]: ceph-1d11c63a-09ac-11ec-83e1-52540031ba78@osd.0.service: Scheduled restart job, restart counter is at 6.
Aug 31 09:51:52 cephadm2 systemd[1]: Stopped Ceph osd.0 for 1d11c63a-09ac-11ec-83e1-52540031ba78.
Aug 31 09:51:52 cephadm2 systemd[1]: ceph-1d11c63a-09ac-11ec-83e1-52540031ba78@osd.0.service: Start request repeated too quickly.
Aug 31 09:51:52 cephadm2 systemd[1]: ceph-1d11c63a-09ac-11ec-83e1-52540031ba78@osd.0.service: Failed with result 'exit-code'.
Aug 31 09:51:52 cephadm2 systemd[1]: Failed to start Ceph osd.0 for 1d11c63a-09ac-11ec-83e1-52540031ba78.

The build I'm using is based on commit a49f10e760b4, with some MDS patches on top (nothing that should affect OSD).

History

#1 Updated by Jeff Layton over 1 year ago

  • Subject changed from osd crash in bluestore code to FAILED ceph_assert(current_shard->second->valid())

#2 Updated by Neha Ojha over 1 year ago

  • Assignee set to Gabriel BenHanokh

Gabi, I am assigning it to you for now, since this looks related to NCB.

#3 Updated by Neha Ojha about 1 year ago

  • Status changed from New to Can't reproduce

#4 Updated by Radoslaw Zarzynski 4 months ago

  • Status changed from Can't reproduce to New

#5 Updated by Yaarit Hatuka 4 months ago

Also reported via telemetry:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=245d51f4792af37e9d6d2e5ae03b75476234c20337d0320971b789032bd76e30&orgId=1

Please note that the earliest version in these reports is 16.2.0 and the most recent one is 16.2.9.
Some of the reporting clusters upgraded to 17.2.0, 17.2.1, and 17.2.3, but there are no crashes from these versions yet.

#6 Updated by Igor Fedotov 3 months ago

Neha Ojha wrote:

Gabi, I am assigning it to you for now, since this looks related to NCB.

No, apparently this is unrelated to NCB - one can see the same issue in 16.2.10 (as pointed by Radek in comment #4) which lacks NCB stuff...

#7 Updated by Igor Fedotov 3 months ago

  • Assignee changed from Gabriel BenHanokh to Adam Kupczyk

IMO this is rather related to DB sharding stuff introduced by https://github.com/ceph/ceph/pull/34006
Hence reassigning to Adam.

Also available in: Atom PDF