Project

General

Profile

Bug #46490

osds crashing during deep-scrub

Added by Lawrence Smith 24 days ago. Updated 4 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
bluestore
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

During scrubbing osds from our 8+3 EC-pool seem to be randomly crashing with the backtrace:

Jul 11 14:52:59 kaa-32 ceph-osd[43587]:      0> 2020-07-11 14:52:58.923 7f0c47ce2700 -1 *** Caught signal (Aborted) **
 in thread 7f0c47ce2700 thread_name:tp_osd_tp

 ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
 1: (()+0x13cc0) [0x7f0c95fb6cc0]
 2: (gsignal()+0x10b) [0x7f0c95a1cd00]
 3: (abort()+0x148) [0x7f0c95a1e388]
 4: (()+0xa0739) [0x7f0c95dac739]
 5: (()+0xcf0d1) [0x7f0c95ddb0d1]
 6: (()+0xcf108) [0x7f0c95ddb108]
 7: (__cxa_rethrow()+0) [0x7f0c95ddb2dc]
 8: (RocksDBBlueFSVolumeSelector::select_prefer_bdev(void*)+0) [0x559410f16f42]
 9: (BlueStore::_decompress(ceph::buffer::v14_2_0::list&, ceph::buffer::v14_2_0::list*)+0x5d0) [0x559410f5715e]
 10: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x1cb9) [0x559410f83485]
 11: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x200) [0x559410f91678]
 12: (ECBackend::be_deep_scrub(hobject_t const&, ScrubMap&, ScrubMapBuilder&, ScrubMap::object&)+0x1ed) [0x559410e2d38f]
 13: (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x385) [0x559410d0c225]
 14: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x79) [0x559410bace25]
 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1686) [0x559410bdbec8]
 16: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x559410bdced5]
 17: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1a) [0x559410d8ec40]
 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x5ea) [0x559410b0cf0c]
 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4b6) [0x5594110e19ba]
 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5594110e4a32]
 21: (()+0x82f3) [0x7f0c95fab2f3]
 22: (clone()+0x3f) [0x7f0c95ae452f]

Meanwhile we have been experiencing frequent inconsistent objects for some time now. After the update to 13.2.6 these have been occuring at a reduced rate (as also noted in https://tracker.ceph.com/issues/22464#note-71). At the same time, some osds keep crashing during deep-scrubbing.
A recent update to 14.2.10 has not improved the situation with deep-scrubbing still turning up inconsistent objects and crashing osds.

In 13.2.6 the osds hit an assert, which is no longer the case in nautilus.

ceph-osd.log_nautilus.gz - Current crash log from 14.2.10 (859 KB) Lawrence Smith, 07/11/2020 02:54 PM

ceph-osd.log_mimic.gz - Old log from 13.2.6 (357 KB) Lawrence Smith, 07/11/2020 03:05 PM

ceph-osd.log_nautilus2.gz (189 KB) Lawrence Smith, 07/15/2020 07:55 AM

osd-164-fsck.out.gz (1.28 KB) Lawrence Smith, 07/15/2020 07:56 AM

osd-164-perf.dump.gz (4.5 KB) Lawrence Smith, 07/15/2020 07:59 AM

grep_fsck_verify.gz (15.4 KB) Lawrence Smith, 07/17/2020 12:43 PM

History

#1 Updated by Igor Fedotov 23 days ago

First of all IMO the Nautilus back trace isn't valid - there is no RocksDBBlueFSVolumeSelector::select_prefer_bdev calls along read/decompress path. And the mimic one looks valid - assertion occurs while decoding compression header.
To make sure https://tracker.ceph.com/issues/22464 is relevant could you please check performance counters for crashing OSDs - 'bluestore_reads_with_retries' specifically.
It makes sense to verify H/W errors with dmesg and/or smartctl.

Also could you please run offline deep fsck for these OSDs - is it showing any errors?

#2 Updated by Lawrence Smith 21 days ago

Hi Igor, thanks for looking into this.
You were right about the corrupted backtrace. Although select_prefer_bdef can be found in most backtraces, I found one that looks like the mimic one, that hits the assert in denc.h.

The dmesg and smart stats were clean for these drives. The 'bluestore_reads_with_retries' always 0.

One fsck also just crashed this morning while others are still running. I have the full 90G log from the fsck if you need something specific from it.

I will append the ceph-osd.log with the correct backtrace, the osd performance dump and the output of the deep fsck, all for the same osd.

#3 Updated by Igor Fedotov 20 days ago

Hi Lawrence
looks like some data corruption (multiple objects referring to the same disk extent) which causes decompression to assert. And given no read retries and got checksum values I think this issue is unrelated to https://tracker.ceph.com/issues/22464

Could you please grep OSD.164 fsck log for "fsck error" or "_verify_csum bad" patterns and share the resulting log if its size is reasonable...

#4 Updated by Lawrence Smith 20 days ago

The whole logfile is 60G. The file 'osd-164-fsck.out.gz' that I uploaded with the last update is the console output of the fsck. This output had both the "fsck error" and "_verify_csum bad", which are identical the respective lines that grep found in the full log:

2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 
0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#

and
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 5#6:2984ab53:::10015d6313d.00000000:head# error during read:  0~4000 (5) Input/output error
2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)
   -16> 2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)
 -9999> 2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)

#5 Updated by Igor Fedotov 19 days ago

Lawrence Smith wrote:

The whole logfile is 60G. The file 'osd-164-fsck.out.gz' that I uploaded with the last update is the console output of the fsck. This output had both the "fsck error" and "_verify_csum bad", which are identical the respective lines that grep found in the full log:
[...]
and
[...]

Are these lines the only ones in osd-164 fsck log which containin "_verify_csum bad" or "fsck error" strings?
In other words - what's the output for:
grep -e "_verify_csum bad" -e "fsck error"

?

#6 Updated by Lawrence Smith 19 days ago

I'm sorry, i wasn't clear in that. Yes, this is the only output. (The line "-9999>[...]" appears only in the console output, not in the log file).
Attached find the output of grep -e "_verify_csum bad" -e "fsck error" -A100 -B100 on the log

#7 Updated by Lawrence Smith 19 days ago

The output for grep -e "_verify_csum bad" -e "fsck error" on the log file is:

2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x151755ca, expected 0x6706be76, device location [0x42d2d60000~1000], logical extent 0x0~1000, object 5#6:2984ab53:::10015d6313d.00000000:head#
2020-07-15 02:59:45.975 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 5#6:2984ab53:::10015d6313d.00000000:head# error during read:  0~4000 (5) Input/output error
921747443:2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)
   -16> 2020-07-15 08:25:42.711 7f402253d940 -1 bluestore(/var/lib/ceph/osd/ceph-164) fsck error: 6#6:13ab44c4:::10015d81a92.00000002:head# extent 0x3fbcfd20000~20000 or a subset is already allocated (misreferenced)

#8 Updated by Lawrence Smith 19 days ago

... Sorry, the preceding "921747443:" in line 6 is just a remnant line number of the grep -n I did initally and forgot to delete.

#9 Updated by Igor Fedotov 13 days ago

Hi Lawrence,
sorry for a delayed response.

So my hypothesis about data corruption is still valid. You have at least 2 errors for this OSD. As far as I can tell they're independent of each other but potentially might have the same origin.
1) Onode 5#6:2984ab53:::10015d6313d.00000000:head# fails to pass checksum verification for some of its data. This results in fsck error indication and might result in data reading error later on.
2) Onode 6#6:13ab44c4:::10015d81a92.00000002:head# uses a physical extent, which is also referenced from another onode. Highly likely this extent was overwritten by that second onode and hence compression header decoding asserts. For unknown reasons checksums aren't present for this specific onode which permits broken data to reach decompressor.

Onode's extent from 1) might be also misreferenced but deep fsck is unable to complete due to 2).
I can't tell what's the root cause of the misreference for now - this could happen long ago with a different Ceph release... So to start investigating this issue we should get rid off existing misreferences and start monitoring for new ones. Just to learn if this bug is still there.

More practical need is to eliminate current assertions. One can do that by removing broken Onode(s) which cause assertions. For now it's Onode from 2) which has to be removed but additional ones might be found once fsck is able to proceed. Let's leave Onode from 1) as-is for now.
To remove onode one can use ceph-objectstore-tool against offline osd. You'll need list command to learn pgid for the object in question. And them proceed with 'remove' one.

Then put OSD back online and initiate deep scrub for this specific PG via ceph cli again. Which might kill some OSDs again at different broken object(s) if any though. So you'll need to repeat remove/scrub steps for these newly detected objects again.

Prior to running the above recovery plan you might want to run regular fsck which will hopefully detect all the misreferences but not checksum errors. This should complete smoothly as there is no data reading/header decoding along this path. Hence you can estimate how many broken objects are present at the OSD, Not all of them cause OSD crash though....

#10 Updated by Lawrence Smith 4 days ago

Thank you for the detailed answer. I used the procedure you described to find all misreferenced objects (215 in total) and delete them. Afterwards I ran a fsck on each osd and found no more such objects.

We restarted all ceph on all hosts on wednesday evening and issued a deep-scrub for all osds which had found such objects.

The following morning we had a power outage. Shortly after bringing the system back online, some osds started crashing once again during _decompress (we had no crashes in the 14hours prior to the crash. Afterwards we had 4 crashes caused by _decompress within an hour).

Sadly I do not have the fsck output after having removed all misreferenced objects. This is the fsck of an osd before removing the object:

2020-07-28 18:13:39.841 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: 7#6:23dceaa4:::10014020c30.00000000:head# extent 0x7252d60000~10000 or a subset is already allocated (misreferenced)
2020-07-28 18:14:27.589 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: free extent 0x2baeb0d0000~40000 intersects allocated blocks
2020-07-28 18:14:27.899 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x3062a40000~10000
2020-07-28 18:14:27.899 7ff2de78d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x24553840000~10000
fsck found 4 error(s)

After the crash, this osd crashed while scrubbing a different object. Here is the output of the fsck after the power outage:

2020-07-31 14:57:26.640 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: 5#6:d118994b:::10003b4aa32.0000004c:head# extent 0x7252d60000~10000 or a subset is already allocated (misreferenced)
2020-07-31 14:58:47.106 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: free extent 0x2baeb0d0000~40000 intersects allocated blocks
2020-07-31 14:58:47.426 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x3062a40000~10000
2020-07-31 14:58:47.426 7f92f121d940 -1 bluestore(/var/lib/ceph/osd/ceph-59) fsck error: leaked extent 0x24553840000~10000
fsck found 4 error(s)

I noticed, that the extent mentioned for the new misreferenced object is always the same as the extent of the object I first removed. Meanwhile the pg of the new misreferenced object was different to all previously affected pgs.

In the mean time, scrubbing has been reporting a lot of scrub errors and inconsistent pgs:

 2020-07-31 12:55:21+02 | kaa-57  | ceph-osd | 190266 |        6 | 2020-07-31 12:55:21.671 7fbe153e7700 -1 log_channel(cluster) log [ERR] : 6.1c9s0 deep-scrub 1 missing, 0 inconsistent objects
 2020-07-31 12:55:21+02 | kaa-57  | ceph-osd | 190266 |        6 | 2020-07-31 12:55:21.671 7fbe153e7700 -1 log_channel(cluster) log [ERR] : 6.1c9 deep-scrub 1 errors
 2020-07-31 12:55:21+02 | kaa-50  | ceph-osd | 189867 |        6 | 2020-07-31 12:55:21.729 7ff29b068700 -1 log_channel(cluster) log [ERR] : 6.25es0 shard 312(6) 6:7a40113a:::10016516ec3.00000838:head : missing
 2020-07-31 12:16:52+02 | kaa-114 | ceph-osd | 75075  |        6 | 2020-07-31 12:16:52.518 7fe31b4ec700 -1 log_channel(cluster) log [ERR] : 6.341 shard 240(7) soid 6:82deec73:::10001a511c5.00000000:head : candidate had a read error
 2020-07-31 12:11:41+02 | kaa-127 | ceph-osd | 7072   |        6 | 2020-07-31 12:11:41.477 7f46520e7700 -1 log_channel(cluster) log [ERR] : 6.1fcs0 shard 256(4) 6:3f9408d5:::1001642c93a.00000000:head : missing
 2020-07-31 11:55:58+02 | kaa-57  | ceph-osd | 190266 |        6 | 2020-07-31 11:55:58.335 7fbe153e7700 -1 log_channel(cluster) log [ERR] : 6.1c9s0 shard 225(7) 6:93a1dfa8:::1001630b27e.00000de2:head : missing

The missing objects are those I had deleted, but we also consistently see these read errors appearing. The pg with the read error is not among those where I deleted any objects, nor is it affected by any new misreferenced objects.

I'm not sure where to look further for this problem. We are running a rather heterogeneous cluster and different kinds of harddrives across the ec pool are affected. We have the write cache enabled for all our disks. As we are running a self compiled version of ceph I am just recompiling ceph with gcc 9.3.0 to rule out any errors our older gcc 7.3.0 may have caused.
A short smart scan came out negative for the affected hard drives, so I just started a long scan for multiple affected devices.

Also available in: Atom PDF