Project

General

Profile

Actions

Bug #46490

open

osds crashing during deep-scrub

Added by Lawrence Smith almost 4 years ago. Updated about 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
bluestore
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During scrubbing osds from our 8+3 EC-pool seem to be randomly crashing with the backtrace:

Jul 11 14:52:59 kaa-32 ceph-osd[43587]:      0> 2020-07-11 14:52:58.923 7f0c47ce2700 -1 *** Caught signal (Aborted) **
 in thread 7f0c47ce2700 thread_name:tp_osd_tp

 ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
 1: (()+0x13cc0) [0x7f0c95fb6cc0]
 2: (gsignal()+0x10b) [0x7f0c95a1cd00]
 3: (abort()+0x148) [0x7f0c95a1e388]
 4: (()+0xa0739) [0x7f0c95dac739]
 5: (()+0xcf0d1) [0x7f0c95ddb0d1]
 6: (()+0xcf108) [0x7f0c95ddb108]
 7: (__cxa_rethrow()+0) [0x7f0c95ddb2dc]
 8: (RocksDBBlueFSVolumeSelector::select_prefer_bdev(void*)+0) [0x559410f16f42]
 9: (BlueStore::_decompress(ceph::buffer::v14_2_0::list&, ceph::buffer::v14_2_0::list*)+0x5d0) [0x559410f5715e]
 10: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x1cb9) [0x559410f83485]
 11: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x200) [0x559410f91678]
 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap&, ScrubMapBuilder&, ScrubMap::object&)+0x1ed) [0x559410e2d38f]
 13: (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x385) [0x559410d0c225]
 14: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x79) [0x559410bace25]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1686) [0x559410bdbec8]
 16: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x559410bdced5]
 17: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1a) [0x559410d8ec40]
 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x5ea) [0x559410b0cf0c]
 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4b6) [0x5594110e19ba]
 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5594110e4a32]
 21: (()+0x82f3) [0x7f0c95fab2f3]
 22: (clone()+0x3f) [0x7f0c95ae452f]

Meanwhile we have been experiencing frequent inconsistent objects for some time now. After the update to 13.2.6 these have been occuring at a reduced rate (as also noted in https://tracker.ceph.com/issues/22464#note-71). At the same time, some osds keep crashing during deep-scrubbing.
A recent update to 14.2.10 has not improved the situation with deep-scrubbing still turning up inconsistent objects and crashing osds.

In 13.2.6 the osds hit an assert, which is no longer the case in nautilus.


Files

ceph-osd.log_nautilus.gz (859 KB) ceph-osd.log_nautilus.gz Current crash log from 14.2.10 Lawrence Smith, 07/11/2020 02:54 PM
ceph-osd.log_mimic.gz (357 KB) ceph-osd.log_mimic.gz Old log from 13.2.6 Lawrence Smith, 07/11/2020 03:05 PM
ceph-osd.log_nautilus2.gz (189 KB) ceph-osd.log_nautilus2.gz Lawrence Smith, 07/15/2020 07:55 AM
osd-164-fsck.out.gz (1.28 KB) osd-164-fsck.out.gz Lawrence Smith, 07/15/2020 07:56 AM
osd-164-perf.dump.gz (4.5 KB) osd-164-perf.dump.gz Lawrence Smith, 07/15/2020 07:59 AM
grep_fsck_verify.gz (15.4 KB) grep_fsck_verify.gz Lawrence Smith, 07/17/2020 12:43 PM
fsck_nondeep.out.gz (289 Bytes) fsck_nondeep.out.gz 1) Lawrence Smith, 09/12/2020 02:07 PM
fsck_deep.out.gz (755 Bytes) fsck_deep.out.gz 2) Lawrence Smith, 09/12/2020 02:07 PM
fsck_deep.log.gz (81.7 KB) fsck_deep.log.gz 3) Lawrence Smith, 09/12/2020 02:07 PM
ceph-osd.log_301.gz (360 KB) ceph-osd.log_301.gz 4) Lawrence Smith, 09/12/2020 02:14 PM
fsck_deep_178.out.gz (1.26 KB) fsck_deep_178.out.gz Lawrence Smith, 10/28/2020 03:15 PM

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #47475: Compressed blobs lack checksumsResolvedIgor Fedotov

Actions
Actions #1

Updated by Igor Fedotov almost 4 years ago

First of all IMO the Nautilus back trace isn't valid - there is no RocksDBBlueFSVolumeSelector::select_prefer_bdev calls along read/decompress path. And the mimic one looks valid - assertion occurs while decoding compression header.
To make sure https://tracker.ceph.com/issues/22464 is relevant could you please check performance counters for crashing OSDs - 'bluestore_reads_with_retries' specifically.
It makes sense to verify H/W errors with dmesg and/or smartctl.

Also could you please run offline deep fsck for these OSDs - is it showing any errors?

Updated by Lawrence Smith almost 4 years ago

Hi Igor, thanks for looking into this.
You were right about the corrupted backtrace. Although select_prefer_bdef can be found in most backtraces, I found one that looks like the mimic one, that hits the assert in denc.h.

The dmesg and smart stats were clean for these drives. The 'bluestore_reads_with_retries' always 0.

One fsck also just crashed this morning while others are still running. I have the full 90G log from the fsck if you need something specific from it.

I will append the ceph-osd.log with the correct backtrace, the osd performance dump and the output of the deep fsck, all for the same osd.

Actions #3

Updated by Igor Fedotov almost 4 years ago

Hi Lawrence
looks like some data corruption (multiple objects referring to the same disk extent) which causes decompression to assert. And given no read retries and got checksum values I think this issue is unrelated to https://tracker.ceph.com/issues/22464

Could you please grep OSD.164 fsck log for "fsck error" or "_verify_csum bad" patterns and share the resulting log if its size is reasonable...

Actions #4

Updated by Lawrence Smith almost 4 years ago

The whole logfile is 60G. The file 'osd-164-fsck.out.gz' that I uploaded with the last update is the console output of the fsck. This output had both the "fsck error" and "_verify_csum bad", which are identical the respective lines that grep found in the full log:

2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#

and
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 5#6:2984ab53:::10015d6313d.00000000:head# error during read:  0~4000 (5) Input/output error
2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)
   -16> 2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)
 -9999> 2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)

Actions #5

Updated by Igor Fedotov almost 4 years ago

Lawrence Smith wrote:

The whole logfile is 60G. The file 'osd-164-fsck.out.gz' that I uploaded with the last update is the console output of the fsck. This output had both the "fsck error" and "_verify_csum bad", which are identical the respective lines that grep found in the full log:
[...]
and
[...]

Are these lines the only ones in osd-164 fsck log which containin "_verify_csum bad" or "fsck error" strings?
In other words - what's the output for:
grep -e "_verify_csum bad" -e "fsck error"

?

Actions #6

Updated by Lawrence Smith almost 4 years ago

I'm sorry, i wasn't clear in that. Yes, this is the only output. (The line "-9999>[...]" appears only in the console output, not in the log file).
Attached find the output of grep -e "_verify_csum bad" -e "fsck error" -A100 -B100 on the log

Actions #7

Updated by Lawrence Smith almost 4 years ago

The output for grep -e "_verify_csum bad" -e "fsck error" on the log file is:

2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 5#6:2984ab53:::10015d6313d.00000000:head# error during read:  0~4000 (5) Input/output error
921747443:2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)
   -16> 2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)

Actions #8

Updated by Lawrence Smith almost 4 years ago

... Sorry, the preceding "921747443:" in line 6 is just a remnant line number of the grep -n I did initally and forgot to delete.

Actions #9

Updated by Igor Fedotov almost 4 years ago

Hi Lawrence,
sorry for a delayed response.

So my hypothesis about data corruption is still valid. You have at least 2 errors for this OSD. As far as I can tell they're independent of each other but potentially might have the same origin.
1) Onode 5#6:2984ab53:::10015d6313d.00000000:head# fails to pass checksum verification for some of its data. This results in fsck error indication and might result in data reading error later on.
2) Onode 6#6:13ab44c4:::10015d81a92.00000002:head# uses a physical extent, which is also referenced from another onode. Highly likely this extent was overwritten by that second onode and hence compression header decoding asserts. For unknown reasons checksums aren't present for this specific onode which permits broken data to reach decompressor.

Onode's extent from 1) might be also misreferenced but deep fsck is unable to complete due to 2).
I can't tell what's the root cause of the misreference for now - this could happen long ago with a different Ceph release... So to start investigating this issue we should get rid off existing misreferences and start monitoring for new ones. Just to learn if this bug is still there.

More practical need is to eliminate current assertions. One can do that by removing broken Onode(s) which cause assertions. For now it's Onode from 2) which has to be removed but additional ones might be found once fsck is able to proceed. Let's leave Onode from 1) as-is for now.
To remove onode one can use ceph-objectstore-tool against offline osd. You'll need list command to learn pgid for the object in question. And them proceed with 'remove' one.

Then put OSD back online and initiate deep scrub for this specific PG via ceph cli again. Which might kill some OSDs again at different broken object(s) if any though. So you'll need to repeat remove/scrub steps for these newly detected objects again.

Prior to running the above recovery plan you might want to run regular fsck which will hopefully detect all the misreferences but not checksum errors. This should complete smoothly as there is no data reading/header decoding along this path. Hence you can estimate how many broken objects are present at the OSD, Not all of them cause OSD crash though....

Actions #10

Updated by Lawrence Smith almost 4 years ago

Thank you for the detailed answer. I used the procedure you described to find all misreferenced objects (215 in total) and delete them. Afterwards I ran a fsck on each osd and found no more such objects.

We restarted all ceph on all hosts on wednesday evening and issued a deep-scrub for all osds which had found such objects.

The following morning we had a power outage. Shortly after bringing the system back online, some osds started crashing once again during _decompress (we had no crashes in the 14hours prior to the crash. Afterwards we had 4 crashes caused by _decompress within an hour).

Sadly I do not have the fsck output after having removed all misreferenced objects. This is the fsck of an osd before removing the object:

2020-07-28 18:13:39.841 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: 7#6:23dceaa4:::10014020c30.00000000:head# extent 0x7252d60000~10000 or a subset is already allocated (misreferenced)
2020-07-28 18:14:27.589 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: free extent 0x2baeb0d0000~40000 intersects allocated blocks
2020-07-28 18:14:27.899 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x3062a40000~10000
2020-07-28 18:14:27.899 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x24553840000~10000
fsck found 4 error(s)

After the crash, this osd crashed while scrubbing a different object. Here is the output of the fsck after the power outage:

2020-07-31 14:57:26.640 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: 5#6:d118994b:::10003b4aa32.0000004c:head# extent 0x7252d60000~10000 or a subset is already allocated (misreferenced)
2020-07-31 14:58:47.106 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: free extent 0x2baeb0d0000~40000 intersects allocated blocks
2020-07-31 14:58:47.426 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x3062a40000~10000
2020-07-31 14:58:47.426 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x24553840000~10000
fsck found 4 error(s)

I noticed, that the extent mentioned for the new misreferenced object is always the same as the extent of the object I first removed. Meanwhile the pg of the new misreferenced object was different to all previously affected pgs.

In the mean time, scrubbing has been reporting a lot of scrub errors and inconsistent pgs:

 2020-07-31 12:55:21+02 | kaa-57  | ceph-osd | 190266 |        6 | 2020-07-31 12:55:21.671 7fbe153e7700 -1 log_channel(cluster) log [ERR] : 6.1c9s0 deep-scrub 1 missing, 0 inconsistent objects
 2020-07-31 12:55:21+02 | kaa-57  | ceph-osd | 190266 |        6 | 2020-07-31 12:55:21.671 7fbe153e7700 -1 log_channel(cluster) log [ERR] : 6.1c9 deep-scrub 1 errors
 2020-07-31 12:55:21+02 | kaa-50  | ceph-osd | 189867 |        6 | 2020-07-31 12:55:21.729 7ff29b068700 -1 log_channel(cluster) log [ERR] : 6.25es0 shard 312(6) 6:7a40113a:::10016516ec3.00000838:head : missing
 2020-07-31 12:16:52+02 | kaa-114 | ceph-osd | 75075  |        6 | 2020-07-31 12:16:52.518 7fe31b4ec700 -1 log_channel(cluster) log [ERR] : 6.341 shard 240(7) soid 6:82deec73:::10001a511c5.00000000:head : candidate had a read error
 2020-07-31 12:11:41+02 | kaa-127 | ceph-osd | 7072   |        6 | 2020-07-31 12:11:41.477 7f46520e7700 -1 log_channel(cluster) log [ERR] : 6.1fcs0 shard 256(4) 6:3f9408d5:::1001642c93a.00000000:head : missing
 2020-07-31 11:55:58+02 | kaa-57  | ceph-osd | 190266 |        6 | 2020-07-31 11:55:58.335 7fbe153e7700 -1 log_channel(cluster) log [ERR] : 6.1c9s0 shard 225(7) 6:93a1dfa8:::1001630b27e.00000de2:head : missing

The missing objects are those I had deleted, but we also consistently see these read errors appearing. The pg with the read error is not among those where I deleted any objects, nor is it affected by any new misreferenced objects.

I'm not sure where to look further for this problem. We are running a rather heterogeneous cluster and different kinds of harddrives across the ec pool are affected. We have the write cache enabled for all our disks. As we are running a self compiled version of ceph I am just recompiling ceph with gcc 9.3.0 to rule out any errors our older gcc 7.3.0 may have caused.
A short smart scan came out negative for the affected hard drives, so I just started a long scan for multiple affected devices.

Actions #11

Updated by Anonymous over 3 years ago

I can confirm the same bug, causing osd flapping since 14.2.10.

2020-08-22 12:14:28.948 7fa5bd45f700 -1 *** Caught signal (Aborted) **
 in thread 7fa5bd45f700 thread_name:tp_osd_tp

 ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
 1: (()+0x128a0) [0x7fa5e01cb8a0]
 2: (gsignal()+0xc7) [0x7fa5dee7df47]
 3: (abort()+0x141) [0x7fa5dee7f8b1]
 4: (()+0x8c957) [0x7fa5df872957]
 5: (()+0x92ae6) [0x7fa5df878ae6]
 6: (()+0x92b21) [0x7fa5df878b21]
 7: (()+0x92d54) [0x7fa5df878d54]
 8: (ceph::buffer::v14_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)+0x250) [0x55e7aca4dbb0]
 9: (ceph::buffer::v14_2_0::create_aligned(unsigned int, unsigned int)+0x22) [0x55e7aca4dc92]
 10: (ceph::buffer::v14_2_0::create_small_page_aligned(unsigned int)+0x2a) [0x55e7aca4e5ea]
 11: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::v14_2_0::list*, IOContext*, bool)+0xda) [0x55e7ac81162a]
 12: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x2065) [0x55e7ac704de5]
 13: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x1bb) [0x55e7ac70b2ab]
 14: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap&, ScrubMapBuilder&, ScrubMap::object&)+0x2d2) [0x55e7ac565a82]
 15: (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x393) [0x55e7ac47bd93]
 16: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x7b) [0x55e7ac30bb4b]
 17: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1733) [0x55e7ac33bad3]
 18: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x55e7ac33cbbf]
 19: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1a) [0x55e7ac50112a]
 20: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55e7ac266245]
 21: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55e7ac86edfc]
 22: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e7ac871fc0]
 23: (()+0x76db) [0x7fa5e01c06db]
 24: (clone()+0x3f) [0x7fa5def60a3f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #12

Updated by Igor Fedotov over 3 years ago

@songbo wang xor - could you please run fsck for the failing OSD?

Actions #13

Updated by Neha Ojha over 3 years ago

  • Status changed from New to Need More Info

Updated by Lawrence Smith over 3 years ago

Neither the new gcc version, nor a new kernel version (upgraded form 4.14 to 5.4) were able to solve this issue for us.

We identified all misreferenced objects through non-deep fsck on all osds in the EC pool and deleted these objects. Still after some time some osds once again report misreferenced objects.

Scrubbing is still regularily finding inconsistencies and leading to osd crashes. Some of these crashes also occur on osds where the fsck does not report any misreferenced objects. A deep fsck on such an osd also crashes without any mention of "_verify_csum bad" or "fsck error" in the log file.

I will append
1) the non-deep fsck output,
2) the console output of the deep fsck,
3) a grep -A1000 -B1000 "Caught signal" on the deep fsck logfile and
4) the ceph_osd.log of an osd crashing during deep-scrub.

Actions #15

Updated by Igor Fedotov over 3 years ago

Lawrence,
I'm not sure that it's enough to remove object to fix such misreferences. This still leaves inconsistent extent references at BlueStore level. E.g. you had two objects referencing the same LBA, no guaranteed that the first object is detected by fsck since it might still have valid csums. When you do remove another conflicting object misreferenced LBA is freed. But the first object still uses it. Hence another allocation for this LBA and subsequent write corrupts the data.

One should run "ceph-bluestore-tool repair" instead to fix such issues at bluestore first.

Actions #16

Updated by Igor Fedotov over 3 years ago

  • Related to Bug #47475: Compressed blobs lack checksums added
Actions #17

Updated by Lawrence Smith over 3 years ago

It seems that the ceph-bluestore-tool repair only temporarily resolves the issue for us.
We ran the repair tool on each osd followed by a fsck (not deep). The fscks were all successful. On selected osds we also ran fsck --deep which were also successful.

For a little while (around 1 week) we had no more crashes of osds (this may be anecdotal), but after a while we once again saw the same crashes as before.
This time the simple fsck a crashing osd came back clean (fsck success) while the fsck --deep crashed.
The logfile for the fsck too large to be uploaded. I will append the console output of the crashed deep fsck.

Actions #18

Updated by Maximilian Stinsky over 3 years ago

We seem to hit the same behaviour after upgrading our ceph cluster from 12.2.12 to 14.2.11.
Since then we have quite a few inconsistent pg's everyday after deep-scrubbing and also see some osd's crashing everyday. This only happens on our ec pool with hdd's and ssd cache.
Ceph is running in centos container on Ubuntu 16.04.5 with Kernel 4.15.0-123-generic.

The order seems to be:
An osd crashes with the already mentioned "_verify_csum bad ... checksum at blob offset 0x0 ..."
OSD is coming back online healthy.
deep-scrub finds a read error on that specific osd after a couple of days.

Actions #19

Updated by Maximilian Stinsky about 3 years ago

Hi,

we tried upgrading our cluster to version 14.2.18 but still have the random scrub error's on the ec pool everyday.
Can we somehow support in gathering any information to find out what causes this?

Actions #20

Updated by Igor Fedotov about 3 years ago

Maximilian Stinsky wrote:

Hi,

we tried upgrading our cluster to version 14.2.18 but still have the random scrub error's on the ec pool everyday.
Can we somehow support in gathering any information to find out what causes this?

Does it find errors on different object every time? Or does deep scrub on the same PG succeed or fail in an arbitrary manner?

In other words I wondering if errors are persistent or not?

Actions #21

Updated by Maximilian Stinsky about 3 years ago

Igor Fedotov wrote:

Maximilian Stinsky wrote:

Hi,

we tried upgrading our cluster to version 14.2.18 but still have the random scrub error's on the ec pool everyday.
Can we somehow support in gathering any information to find out what causes this?

Does it find errors on different object every time? Or does deep scrub on the same PG succeed or fail in an arbitrary manner?

In other words I wondering if errors are persistent or not?

It seems to me that the scrub errors are completely random. I checked a couple of failed object and they do not reappear before and after they were found once.
From what we can see is that the osd and host on which it fails is also random. The only thing that is consistent is that it only triggers that behaviour on our ec pool.

Actions #22

Updated by Igor Fedotov about 3 years ago

Maximilian Stinsky wrote:

Igor Fedotov wrote:

Maximilian Stinsky wrote:

Hi,

we tried upgrading our cluster to version 14.2.18 but still have the random scrub error's on the ec pool everyday.
Can we somehow support in gathering any information to find out what causes this?

Does it find errors on different object every time? Or does deep scrub on the same PG succeed or fail in an arbitrary manner?

In other words I wondering if errors are persistent or not?

It seems to me that the scrub errors are completely random. I checked a couple of failed object and they do not reappear before and after they were found once.
From what we can see is that the osd and host on which it fails is also random. The only thing that is consistent is that it only triggers that behaviour on our ec pool.

Can you see any non-zero values for "bluestore_reads_with_retries' performance counter over you OSDs?

Actions #23

Updated by Maximilian Stinsky about 3 years ago

Igor Fedotov wrote:

Maximilian Stinsky wrote:

Igor Fedotov wrote:

Maximilian Stinsky wrote:

Hi,

we tried upgrading our cluster to version 14.2.18 but still have the random scrub error's on the ec pool everyday.
Can we somehow support in gathering any information to find out what causes this?

Does it find errors on different object every time? Or does deep scrub on the same PG succeed or fail in an arbitrary manner?

In other words I wondering if errors are persistent or not?

It seems to me that the scrub errors are completely random. I checked a couple of failed object and they do not reappear before and after they were found once.
From what we can see is that the osd and host on which it fails is also random. The only thing that is consistent is that it only triggers that behaviour on our ec pool.

Can you see any non-zero values for "bluestore_reads_with_retries' performance counter over you OSDs?

We are quite sure that we found the issue. We identified that we had some kernel mismatch between hosts and that all hosts effected by the checksum errors were running 4.13.0-36-generic. After downgrading all hosts with that specific kernel to version 4.10.0-28-generic the checksum error are decreasing significantly day by day. We are assuming that we are at the moment finding the last scrub errors from objects that were running with the "broken" kernel and monitor if the issue disappears completely in the next days.

After we identified that the kernel might be the problem we also found a related bug report: https://tracker.ceph.com/issues/22464
So it seems that we had a somewhat unlucky timing with upgrading to nautilus and also upgrading the kernel.

Actions

Also available in: Atom PDF