Project

General

Profile

Actions

Bug #46490

open

osds crashing during deep-scrub

Added by Lawrence Smith almost 4 years ago. Updated about 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
bluestore
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During scrubbing osds from our 8+3 EC-pool seem to be randomly crashing with the backtrace:

Jul 11 14:52:59 kaa-32 ceph-osd[43587]:      0> 2020-07-11 14:52:58.923 7f0c47ce2700 -1 *** Caught signal (Aborted) **
 in thread 7f0c47ce2700 thread_name:tp_osd_tp

 ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
 1: (()+0x13cc0) [0x7f0c95fb6cc0]
 2: (gsignal()+0x10b) [0x7f0c95a1cd00]
 3: (abort()+0x148) [0x7f0c95a1e388]
 4: (()+0xa0739) [0x7f0c95dac739]
 5: (()+0xcf0d1) [0x7f0c95ddb0d1]
 6: (()+0xcf108) [0x7f0c95ddb108]
 7: (__cxa_rethrow()+0) [0x7f0c95ddb2dc]
 8: (RocksDBBlueFSVolumeSelector::select_prefer_bdev(void*)+0) [0x559410f16f42]
 9: (BlueStore::_decompress(ceph::buffer::v14_2_0::list&, ceph::buffer::v14_2_0::list*)+0x5d0) [0x559410f5715e]
 10: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x1cb9) [0x559410f83485]
 11: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x200) [0x559410f91678]
 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap&, ScrubMapBuilder&, ScrubMap::object&)+0x1ed) [0x559410e2d38f]
 13: (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x385) [0x559410d0c225]
 14: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x79) [0x559410bace25]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1686) [0x559410bdbec8]
 16: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x559410bdced5]
 17: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1a) [0x559410d8ec40]
 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x5ea) [0x559410b0cf0c]
 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4b6) [0x5594110e19ba]
 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5594110e4a32]
 21: (()+0x82f3) [0x7f0c95fab2f3]
 22: (clone()+0x3f) [0x7f0c95ae452f]

Meanwhile we have been experiencing frequent inconsistent objects for some time now. After the update to 13.2.6 these have been occuring at a reduced rate (as also noted in https://tracker.ceph.com/issues/22464#note-71). At the same time, some osds keep crashing during deep-scrubbing.
A recent update to 14.2.10 has not improved the situation with deep-scrubbing still turning up inconsistent objects and crashing osds.

In 13.2.6 the osds hit an assert, which is no longer the case in nautilus.


Files

ceph-osd.log_nautilus.gz (859 KB) ceph-osd.log_nautilus.gz Current crash log from 14.2.10 Lawrence Smith, 07/11/2020 02:54 PM
ceph-osd.log_mimic.gz (357 KB) ceph-osd.log_mimic.gz Old log from 13.2.6 Lawrence Smith, 07/11/2020 03:05 PM
ceph-osd.log_nautilus2.gz (189 KB) ceph-osd.log_nautilus2.gz Lawrence Smith, 07/15/2020 07:55 AM
osd-164-fsck.out.gz (1.28 KB) osd-164-fsck.out.gz Lawrence Smith, 07/15/2020 07:56 AM
osd-164-perf.dump.gz (4.5 KB) osd-164-perf.dump.gz Lawrence Smith, 07/15/2020 07:59 AM
grep_fsck_verify.gz (15.4 KB) grep_fsck_verify.gz Lawrence Smith, 07/17/2020 12:43 PM
fsck_nondeep.out.gz (289 Bytes) fsck_nondeep.out.gz 1) Lawrence Smith, 09/12/2020 02:07 PM
fsck_deep.out.gz (755 Bytes) fsck_deep.out.gz 2) Lawrence Smith, 09/12/2020 02:07 PM
fsck_deep.log.gz (81.7 KB) fsck_deep.log.gz 3) Lawrence Smith, 09/12/2020 02:07 PM
ceph-osd.log_301.gz (360 KB) ceph-osd.log_301.gz 4) Lawrence Smith, 09/12/2020 02:14 PM
fsck_deep_178.out.gz (1.26 KB) fsck_deep_178.out.gz Lawrence Smith, 10/28/2020 03:15 PM

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #47475: Compressed blobs lack checksumsResolvedIgor Fedotov

Actions
Actions

Also available in: Atom PDF