Project

General

Profile

Actions

Bug #48849

open

BlueStore.cc: 11380: FAILED ceph_assert(r == 0)

Added by Christian Rohmann over 3 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We experienced a few OSD crashes all with the same signature in the logs:

--- cut ---
2021-01-08 06:13:54.946 7f37289d0700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 1858706858, got 291516022 in db/146430.sst offset 32863300 size 538979 code = 2 Rocksdb transaction:
Put( Prefix = P key = 0x0000000000000974'.0001234154.00000000000472581834' Value size = 184)
Put( Prefix = P key = 0x0000000000000974'._fastinfo' Value size = 186)
Put( Prefix = O key = 0x7f800000000000000205e3b390217262'd_data.e46d9e7c3dbd3d.0000000000007c24!='0xfffffffffffffffeffffffffffffffff6f00000000'x' Value size = 468)
Put( Prefix = O key = 0x7f800000000000000205e3b390217262'd_data.e46d9e7c3dbd3d.0000000000007c24!='0xfffffffffffffffeffffffffffffffff'o' Value size = 541)
Merge( Prefix = b key = 0x0000012a02680000 Value size = 16)
Merge( Prefix = b key = 0x0000012ca1980000 Value size = 16)
Merge( Prefix = b key = 0x0000012ca1980000 Value size = 16)
Merge( Prefix = b key = 0x0000017981680000 Value size = 16)
2021-01-08 06:13:54.974 7f37289d0700 -1 /build/ceph-14.2.16/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7f37289d0700 time 2021-01-08 06:13:54.954115
/build/ceph-14.2.16/src/os/bluestore/BlueStore.cc: 11380: FAILED ceph_assert(r == 0)

ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55fd49604fba]
2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55fd49605195]
3: (BlueStore::_kv_sync_thread()+0x1144) [0x55fd49b78bb4]
4: (BlueStore::KVSyncThread::entry()+0xd) [0x55fd49b9bd8d]
5: (()+0x76db) [0x7f37381136db]
6: (clone()+0x3f) [0x7f3736eb071f]

2021-01-08 06:13:54.978 7f37289d0700 -1 ** Caught signal (Aborted) *
in thread 7f37289d0700 thread_name:bstore_kv_sync

ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)
1: (()+0x12980) [0x7f373811e980]
2: (gsignal()+0xc7) [0x7f3736dcdfb7]
3: (abort()+0x141) [0x7f3736dcf921]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x55fd4960500b]
5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55fd49605195]
6: (BlueStore::_kv_sync_thread()+0x1144) [0x55fd49b78bb4]
7: (BlueStore::KVSyncThread::entry()+0xd) [0x55fd49b9bd8d]
8: (()+0x76db) [0x7f37381136db]
9: (clone()+0x3f) [0x7f3736eb071f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- cut ---

While there was somewhat of a burst of multiple OSD on the same machine experiencing this very error, it has occurred on other nodes as well.
I added quite a bit more of the ceph-osd.log in osd_log_assert_failure.log and would be more than happy to provide mode details to narrow this down if I can.


Files

osd_log__assert_failure.log.gz (204 KB) osd_log__assert_failure.log.gz Christian Rohmann, 01/12/2021 12:39 PM
osd-59-fsck.output (7.9 KB) osd-59-fsck.output fsck --deep on crashed OSD Christian Rohmann, 01/13/2021 02:45 PM
all_crash_reports.txt (8.33 KB) all_crash_reports.txt ceph crash info for all crashes Christian Rohmann, 01/17/2021 03:31 PM
crash_stacktraces.log (65.2 KB) crash_stacktraces.log Christian Rohmann, 01/19/2021 10:11 AM
journald_output_all_osd.tar.gz (714 KB) journald_output_all_osd.tar.gz Christian Rohmann, 01/19/2021 06:26 PM
perf_dump_osds.tar.gz (22 KB) perf_dump_osds.tar.gz ceph osd perf dump of all OSDs Christian Rohmann, 02/02/2021 09:56 AM
Actions #1

Updated by Christian Rohmann over 3 years ago

BTW, I looked through other reported issues and found https://tracker.ceph.com/issues/48002 or https://tracker.ceph.com/issues/41367 to be similar, but not really the very same issue.

Actions #2

Updated by Igor Fedotov over 3 years ago

Wondering if you had experienced any recent OSD crashes prior to this failure?

You might also want to Check for HW failures as the first step.

Running deep fsck for at least one daemon worth trying as well. This will hopefully answer the primary question - whether the failure is persistent. Which I suppose it is. And at this point if OSDs are still able to startup I'd recommend to consider their redeployment ASAP. which in turn has to be implemented carefully enough since more writing to these OSDs can make the situation even worse.

You might want to read through the following mail-thread to learn the procedure to disable DB compaction which IMO makes sense at this point.
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DEOBAUXQBUFL6HNBBNJ3LMQUCQC76HLY/

As for other reports - highly likely you've got a sort of data corruption in KV and subsequent assertion is just a response to this corruption detection. Other tickets have similar appearance since they are about KV data corruption as well. But at this point it's hard to estimate whether the root cause for such a corruption is the same.

Actions #3

Updated by Christian Rohmann over 3 years ago

Igor Fedotov wrote:

Wondering if you had experienced any recent OSD crashes prior to this failure?

Not recently, no. The last OSD crashes happend in September of 2020 and they were not on this node).
Do you think them being also in the kv_sync_thread()

/build/ceph-14.2.10/src/os/bluestore/BlueStore.cc: 11068: FAILED ceph_assert(r == 0)
 -1 /build/ceph-14.2.11/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fb2b2f67700 time 2020-09-01 11:08:35.502791

makes them related in any way? Mind you this was with another version (14.2.10) still. Currently all our OSDs run 14.2.16.

You might also want to Check for HW failures as the first step.

From what we can see. Kernel logs, IO error counters, device resets, SMART, ... there is no indication of a hardware issue.
But it's somewhat strange that multiple OSD on the same host have the "exact" same crash spread over a few days.

Running deep fsck for at least one daemon worth trying as well. This will hopefully answer the primary question - whether the failure is persistent. Which I suppose it is. And at this point if OSDs are still able to startup I'd recommend to consider their redeployment ASAP. which in turn has to be implemented carefully enough since more writing to these OSDs can make the situation even worse.

You might want to read through the following mail-thread to learn the procedure to disable DB compaction which IMO makes sense at this point.
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DEOBAUXQBUFL6HNBBNJ3LMQUCQC76HLY/

As for other reports - highly likely you've got a sort of data corruption in KV and subsequent assertion is just a response to this corruption detection. Other tickets have similar appearance since they are about KV data corruption as well. But at this point it's hard to estimate whether the root cause for such a corruption is the same.

Thank you very much for the pointers!

We offlined an OSD and ran an fsck --deep, which resulted in

/build/ceph-14.2.16/src/os/bluestore/Allocator.cc: 53: FAILED ceph_assert(r == 0)

errors reported. See the all the output in the attached file osd-59-fsck.output.

Actions #4

Updated by Christian Rohmann over 3 years ago

Could there be an issue with a recent RocksDB update as observed by the Gentoo folks: https://bugs.gentoo.org/764221 ?

Is this in any way related to the bug / discussion the you referenced (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DEOBAUXQBUFL6HNBBNJ3LMQUCQC76HLY/) regarding RocksDB corruption?

Or https://tracker.ceph.com/issues/48276 and the bitmap allocator?

Actions #5

Updated by Christian Rohmann over 3 years ago

I ran the fsck with bluestore debug level 20 again, in case you might need more details. The whole file is about 18 Gig (uncompressed).
But maybe the last bit leading up to the asset error would be of help? Just let me know what you might need.

Actions #6

Updated by Igor Fedotov over 3 years ago

Christian Rohmann wrote:

We offlined an OSD and ran an fsck --deep, which resulted in

[...]

errors reported. See the all the output in the attached file osd-59-fsck.output.

This is rather a duplicate of a minor issue reported at https://tracker.ceph.com/issues/47443

And it's different from your original issue which looks like a DB corruption of unknown origin.

You might want to specify a different allocator when running fsck to work around it, e.g.
CEPH_ARGS="--bluestore-allocator=avl" ceph-bluestore-tool ....

But I presume general fsck result should be reported prior to this assertion hence wondering what was the stdout(?) output? Something saying how many errors/warnings found during the check...

Actions #7

Updated by Igor Fedotov over 3 years ago

Christian Rohmann wrote:

Could there be an issue with a recent RocksDB update as observed by the Gentoo folks: https://bugs.gentoo.org/764221 ?

They're facing https://tracker.ceph.com/issues/47443
The same applies to your fsck run.
But your original issue is different.

Is this in any way related to the bug / discussion the you referenced (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DEOBAUXQBUFL6HNBBNJ3LMQUCQC76HLY/) regarding RocksDB corruption?

I don't know whether it's related or not at this point. Well - final state is pretty similar - data corruption in RocksDB. But whether the root cause is the same I can't say for now.

Or https://tracker.ceph.com/issues/48276 and the bitmap allocator?

I doubt this one is related - it causes an assertion while attempting to access out-of-bound offset on disk. Which isn't the case for you.

Actions #8

Updated by Igor Fedotov over 3 years ago

So at this point I need some clarification on what you'd like to do first:
1) Recover OSDs. It's still unclear whether OSDs are able to startup (in regular operational mode, not fsck one) at this point.
Once completing fsck (if it's successful there is a good chance that actual data are unaffected and retrievable) one can try to pull data out of these OSDs and redeploy them.
or
2) Investigate the root cause for the failure (which in fact has a little chances of success).
For this one I'd suggest to preserve as much logs preceeding to the crash as possible. Hopefully they'll provide some insight.

Actions #9

Updated by Christian Rohmann over 3 years ago

Igor Fedotov wrote:

Christian Rohmann wrote:

We offlined an OSD and ran an fsck --deep, which resulted in

[...]

errors reported. See the all the output in the attached file osd-59-fsck.output.

This is rather a duplicate of a minor issue reported at https://tracker.ceph.com/issues/47443

And it's different from your original issue which looks like a DB corruption of unknown origin.

You might want to specify a different allocator when running fsck to work around it, e.g.
CEPH_ARGS="--bluestore-allocator=avl" ceph-bluestore-tool ....

But I presume general fsck result should be reported prior to this assertion hence wondering what was the stdout(?) output? Something saying how many errors/warnings found during the check...

Yes the error the fsck ended with is different from the original crash.
But there is no other output from fsck but what's in https://tracker.ceph.com/attachments/download/5324/osd-59-fsck.output
So no report on number of errors or anything of that sort.

The ODS is still offline and unchanged. I rerun the fsck with the proposed workaround regarding the Allocator now then.
Any particular debug log level you want?

Actions #10

Updated by Christian Rohmann over 3 years ago

Igor Fedotov wrote:

So at this point I need some clarification on what you'd like to do first:
1) Recover OSDs. It's still unclear whether OSDs are able to startup (in regular operational mode, not fsck one) at this point.
Once completing fsck (if it's successful there is a good chance that actual data are unaffected and retrievable) one can try to pull data out of these OSDs and redeploy them.
or
2) Investigate the root cause for the failure (which in fact has a little chances of success).
For this one I'd suggest to preserve as much logs preceeding to the crash as possible. Hopefully they'll provide some insight.

I'd rather investigate (or help you analyze) the issue further (2). Multiple OSDs on that node crashed (at various times over the course of a few days, but they all came back up and are up - apart from the one we are fscking now to narrow down on the root cause.

Thanks a ton for digging into this with me! Just let me know that you might need to further analyze the root cause.

Actions #11

Updated by Christian Rohmann over 3 years ago

Clarification: All ODSs that crashed came back up, I manually (!) took a single one offline to do fsck. Until then that one was back online as well.

Actions #12

Updated by Christian Rohmann over 3 years ago

The fsck using

CEPH_ARGS="--bluestore-allocator=avl" ceph-bluestore-tool fsck --path $osdpath --deep true --debug-bluestore=20

just returned

fsck success

So this is of no help in regards to the originally observed crash then?

Actions #13

Updated by Igor Fedotov over 3 years ago

Christian Rohmann wrote:

The fsck using

[...]

just returned

[...]

So this is of no help in regards to the originally observed crash then?

well, this means that the data corruption is rather non-persistent. This is helpful if one needs to recover data but it provides a little help for root cause analysis.

Could you please share the log snippets for different crashes?

Actions #14

Updated by Christian Rohmann over 3 years ago

I did a

 ceph crash info

on all recent crashes. See all_crash_reports.txt for the details.

Some observations:

  • stack trace looks REALLY similar
  • all crashes happened on the same node (indicates a HW issue, right?)
  • different OSD every time
  • crashes happened hours apart
Actions #15

Updated by Igor Fedotov about 3 years ago

Actual OSDs logs prior to the assertion are what I'd really like to see. The attached crash info isn't enough - it's about kv_sync_thread backtrace (which is a consequence of data corruption failure happened before). RocksDB log output immediately preceding the exception is more interesting, e.g.

-12> 2021-01-08 06:13:54.418 7f37289d0700  4 rocksdb: [db/db_impl_write.cc:1470] [default] New memtable created with log file: #146429. Immutable memtables: 0.
-11> 2021-01-08 06:13:54.418 7f3729bd8700  4 rocksdb: (Original Log Time 2021/01/08-06:13:54.425624) [db/db_impl_compaction_flush.cc:2073] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 0
-10> 2021-01-08 06:13:54.418 7f3729bd8700 4 rocksdb: [db/flush_job.cc:322] [default] [JOB 2675] Flushing memtable with next log file: 146429
-9> 2021-01-08 06:13:54.418 7f3729bd8700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1610086434425712, "job": 2675, "event": "flush_started", "num_memtables": 1, "num_entries": 859428, "num_deletes": 107432, "total_data_size": 244562063, "memory_usage": 260051184, "flush_reason": "Write Buffer Full"}
-8> 2021-01-08 06:13:54.418 7f3729bd8700 4 rocksdb: [db/flush_job.cc:351] [default] [JOB 2675] Level-0 flush table #146430: started
-7> 2021-01-08 06:13:54.558 7f37279ce700 5 prioritycache tune_memory target: 4294967296 mapped: 4357226496 unmapped: 465674240 heap: 4822900736 old mem: 2282611566 new mem: 2251913762
-6> 2021-01-08 06:13:54.562 7f370eba2700 5 osd.59 1234154 heartbeat osd_stat(store_statfs(0x9898387000/0xb9b30000/0x1bf19400000, data 0x12a2815939e/0x125c7240000, compress 0x0/0x0/0x0, omap 0xbe7a43, meta 0xb8f485bd), peers [0,1,2,3,4,5,6,8,9,10,11,12,13,15,16,17,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,58,60] op hist [1])
-5> 2021-01-08 06:13:54.946 7f3729bd8700 4 rocksdb: [db/flush_job.cc:392] [default] [JOB 2675] Level-0 flush table #146430: 33403146 bytes Corruption: block checksum mismatch: expected 1858706858, got 291516022 in db/146430.sst offset 32863300 size 538979
-4> 2021-01-08 06:13:54.946 7f3729bd8700 2 rocksdb: [db/db_impl_compaction_flush.cc:2123] Waiting after background flush error: Corruption: block checksum mismatch: expected 1858706858, got 291516022 in db/146430.sst offset 32863300 size 538979Accumulated background error counts: 1
-3> 2021-01-08 06:13:54.946 7f3729bd8700 4 rocksdb: (Original Log Time 2021/01/08-06:13:54.952439) EVENT_LOG_v1 {"time_micros": 1610086434952421, "job": 2675, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [2, 4, 37, 0, 0, 0, 0], "immutable_memtables": 1}
-2> 2021-01-08 06:13:54.946 7f37289d0700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 1858706858, got 291516022 in db/146430.sst offset 32863300 size 538979 code = 2 Rocksdb transaction:

From the log snippet above one can see that issue happened to memtable flash. Wondering if other cases are bound to it as well...

Actions #16

Updated by Christian Rohmann about 3 years ago

I attached stack traces in file crash_stacktraces.log.

Unfortunately since this was shipped from syslog -> elasticsearch with only per-second time precision the order of the stack traces is a little messed up.

I sincerely apologize and hope you can still use this to further narrow down the issue.

Actions #17

Updated by Igor Fedotov about 3 years ago

This doesn't have preceding rocksdb log output. Hence makes a little sense.

Sorry it looks like a dead end from root cause analysis point of view for now.

Actions #18

Updated by Christian Rohmann about 3 years ago

I was able to dump all of the output of the osds from journald now, properly timestamped - see the attached file journald_output_all_osd.tar.gz.
Apparently there no rocksdb related output, as it does not contain anything that looks like "rocksdb: EVENT_LOG_v1".
Really too bad, that we rotated the rocksdb logs away and those logs of all the OSDs are of no help.

Actions #19

Updated by Igor Fedotov about 3 years ago

Christian Rohmann wrote:

I was able to dump all of the output of the osds from journald now, properly timestamped - see the attached file journald_output_all_osd.tar.gz.
Apparently there no rocksdb related output, as it does not contain anything that looks like "rocksdb: EVENT_LOG_v1".
Really too bad, that we rotated the rocksdb logs away and those logs of all the OSDs are of no help.

Hi @Christian,
sorry for the long analysis. But again nothing very interesting...
One more question.
Could you please dump performance counters for all the OSDs (or at least ones which ever experienced the issue) and inspect "bluestore_reads_with_retries" one? Are there any non-zero values present? Please note that this counter is reset on OSD restart hence the longer OSD is running the more helpful the result.
Non-zero value indicates interim disk read errors that Bluestore has detected and retried. We've observed such issues before under some still unclear conditions. Unfortunately such error handling (both detection and read retries) is available for user data only and DB might still suffer from it. So just want to learn if such errors are observed from time to time...

Actions #20

Updated by Christian Rohmann about 3 years ago

Igor Fedotov wrote:

Hi @Christian,
sorry for the long analysis. But again nothing very interesting...

Too bad. But thank you for staying at it!

One more question.
Could you please dump performance counters for all the OSDs (or at least ones which ever experienced the issue) and inspect "bluestore_reads_with_retries" one? Are there any non-zero values present? Please note that this counter is reset on OSD restart hence the longer OSD is running the more helpful the result.
Non-zero value indicates interim disk read errors that Bluestore has detected and retried. We've observed such issues before under some still unclear conditions. Unfortunately such error handling (both detection and read retries) is available for user data only and DB might still suffer from it. So just want to learn if such errors are observed from time to time...

Attached in file perf_dump_osds.tar.gz.

But all bluestore_reads_with_retries of the OSDs are at "0" ....

# jq .bluestore.bluestore_reads_with_retries osd.5*

0                                                                                                                                          
0                                                                                                                                          
0                                                                                                                                          
0
0
0

Actions #21

Updated by Igor Fedotov about 3 years ago

@Christian - thanks for the update. Could you please keep monitoring these counters on a per-day basis for a while?

And some more questions:
1) Do OSDs have standalone faster drives for DB/WAL or do they share main device for both user data and DB?
2) What hardware (disk drives and controllers) is behind DB? Any RAID controllers in use? From my experience the latter tend to hide disk read errors and/or smartctl info and hence you might be lacking the proper view of your disk states. May be there is any correlation between specific nodes or specific H/W and failure appearance or something?

Actions #22

Updated by Christian Rohmann about 3 years ago

Igor Fedotov wrote:

@Christian - thanks for the update. Could you please keep monitoring these counters on a per-day basis for a while?

I sure can, but the system / osds are up for 15 days now and the crash I reported as apparently not appeared again.

And some more questions:
1) Do OSDs have standalone faster drives for DB/WAL or do they share main device for both user data and DB?

They share the main drive.
They are SSDs, Seagate Nytro 1000 (XA1920ME10063) - 1.92 TB each.

2) What hardware (disk drives and controllers) is behind DB? Any RAID controllers in use? From my experience the latter tend to hide disk read errors and/or smartctl info and hence you might be lacking the proper view of your disk states. May be there is any correlation between specific nodes or specific H/W and failure appearance or something?

No RAID (of course), they are connected to an HBA "LSI Logic HBA 9300-8i SGL SAS".

Actions #23

Updated by Igor Fedotov about 3 years ago

Adam has submitted a PR which might be helpful in detecting transient read errors:

https://github.com/ceph/ceph/pull/39185

Actions #24

Updated by Chris K about 3 years ago

I think I have encountered this same issue on 15.2.5 running ubuntu 18.04.5. I reproduced the problem with debug_rocksdb set to 20/5. I have logs for all affected OSDs from substantial time leading up to the crash event.

Would it be better to create a new bug and reference this bug or simply add what I have to this one?

Actions #25

Updated by Igor Fedotov about 3 years ago

Chris K wrote:

I think I have encountered this same issue on 15.2.5 running ubuntu 18.04.5. I reproduced the problem with debug_rocksdb set to 20/5. I have logs for all affected OSDs from substantial time leading up to the crash event.

Would it be better to create a new bug and reference this bug or simply add what I have to this one?

Not a big deal but IMO different ticket is better...

Actions #26

Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Need More Info
Actions

Also available in: Atom PDF