Project

General

Profile

Actions

Bug #39525

closed

lz4 compressor corrupts data when buffers are unaligned

Added by Erik Lindahl about 5 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In conjunction with taking a new storage server online we observed that 5 out of the 6 SSDs we use to store metadata (with 3-fold replication) went offline, and crash every time they try to restart with an abort signal in OSDMap::decode, which means our CephFS storage is no longer available.

Here is a brief excerpt from the log:

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:db/MANIFEST-001493 succeeded,manifest_file_number is 1493, next_file_number is 1496, last_sequence is 45904669, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 1491

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 1492

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168125624, "job": 1, "event": "recovery_started", "log_files": [1494]}
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:551] Recovering log #1494 mode 0
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:2863] Creating manifest 1496

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168126875, "job": 1, "event": "recovery_finished"}
2019-04-26 17:56:08.127 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 0x5634c2f60000
2019-04-26 17:56:08.127 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-04-26 17:56:08.135 7f4f2956ae00 1 freelist init
2019-04-26 17:56:08.143 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc opening allocation metadata
2019-04-26 17:56:08.147 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc loaded 223 GiB in 233 extents
2019-04-26 17:56:08.151 7f4f2956ae00 -1 ** Caught signal (Aborted) *
in thread 7f4f2956ae00 thread_name:ceph-osd

ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
1: (()+0x92b730) [0x5634c0151730]
2: (()+0x12890) [0x7f4f1f02b890]
3: (gsignal()+0xc7) [0x7f4f1df06e97]
4: (abort()+0x141) [0x7f4f1df08801]
5: (()+0x8c8b7) [0x7f4f1e8fb8b7]
6: (()+0x92a06) [0x7f4f1e901a06]
7: (()+0x92a41) [0x7f4f1e901a41]
8: (()+0x92c74) [0x7f4f1e901c74]
9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f4f20aff694]
10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f4f20b00af1]
11: (OSDService::try_get_map(unsigned int)+0x508) [0x5634bfbf73a8]
12: (OSDService::get_map(unsigned int)+0x1e) [0x5634bfc56ffe]
13: (OSD::init()+0x1d5f) [0x5634bfc048ef]
14: (main()+0x383d) [0x5634bfaef8cd]
15: (__libc_start_main()+0xe7) [0x7f4f1dee9b97]
16: (_start()+0x2a) [0x5634bfbb97aa]

This might have happened due to high server load when a number of other OSDs came online (but none of the new ones were in the "ssd" class, so it should not have had any direct impact on backfilling of these OSDs).

The SSDs in question are a mix of Samsung SM863/883, 256GB. All Ceph components are 13.2.5, running on Ubuntu 18.04, dual socket Supermicro servers (2*8 cores, 64GB ECC memory) with LSI3008IT controllers in JBOD mode.

We have seen this at least once in the past, and I suspect it might be related to high load (?) in the servers when lots of PGs are peering and/or large amounts of backfilling happens. In that case it was only a single disk, so we "fixed" it by just recreating that OSD - but this time we need to get them working to avoid losing metadata PGs.

We had another OSD fail simultaneously (with the same error) that did not contain any important data at all, which enabled us to play around a bit with ceph-objectstore-tool. Based on the errors above we tested replacing the osdmap with one from a working OSD, but even after this we get an error when trying to read the osdmap again:

root@storage05:/var/log/ceph# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-55 --op get-osdmap --file=/tmp/osdmap
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: bad crc, actual 3828477398 != expected 3773790681
  • Caught signal (Aborted) *
    in thread 7f7611698f80 thread_name:ceph-objectstor
    ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
    1: (()+0x9895c0) [0x562e4aa3d5c0]
    2: (()+0x12890) [0x7f760715c890]
    3: (gsignal()+0xc7) [0x7f7606033e97]
    4: (abort()+0x141) [0x7f7606035801]
    5: (()+0x8c8b7) [0x7f7606a288b7]
    6: (()+0x92a06) [0x7f7606a2ea06]
    7: (()+0x92a41) [0x7f7606a2ea41]
    8: (()+0x92c74) [0x7f7606a2ec74]
    9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f7607af2694]
    10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f7607af3af1]
    11: (get_osdmap(ObjectStore
    , unsigned int, OSDMap&, ceph::buffer::list&)+0x1e5) [0x562e4a473655]
    12: (main()+0x39be) [0x562e4a3a43fe]
    13: (__libc_start_main()+0xe7) [0x7f7606016b97]
    14: (_start()+0x2a) [0x562e4a4727aa]
    Aborted (core dumped)

Finally, we used ceph-objectstore-tool to try and export the specific PGs that are offline. This works for some of the PGs, but for others I get the same abort message as above. Irritatingly enough, there is only 1 single PG that we were not able to export from any of the servers, which means we still have a strong incentive to work around this.

And, of course, our other reason for wanting to fix this properly is that we're a bit worried if it's a bug that can cause a chain-effect among OSDs leading to instant unavailability and potential data loss even with replicated storage.

Our next step is to play around with the source and check if it's just a matter of a CRC value not being updated, or if it's some other osdmap-related data structure that has been corrupted.


Related issues 3 (0 open3 closed)

Related to RADOS - Bug #41240: All of the cluster SSDs aborted at around the same time and will not start.Can't reproduceBrad Hubbard

Actions
Copied to RADOS - Backport #44489: mimic: lz4 compressor corrupts data when buffers are unalignedRejectedNathan CutlerActions
Copied to RADOS - Backport #44490: nautilus: lz4 compressor corrupts data when buffers are unalignedResolvedNathan CutlerActions
Actions

Also available in: Atom PDF