Bug #39525
closedlz4 compressor corrupts data when buffers are unaligned
0%
Description
In conjunction with taking a new storage server online we observed that 5 out of the 6 SSDs we use to store metadata (with 3-fold replication) went offline, and crash every time they try to restart with an abort signal in OSDMap::decode, which means our CephFS storage is no longer available.
Here is a brief excerpt from the log:
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:db/MANIFEST-001493 succeeded,manifest_file_number is 1493, next_file_number is 1496, last_sequence is 45904669, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 1491
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 1492
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168125624, "job": 1, "event": "recovery_started", "log_files": [1494]}
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:551] Recovering log #1494 mode 0
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:2863] Creating manifest 1496
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168126875, "job": 1, "event": "recovery_finished"}
2019-04-26 17:56:08.127 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 0x5634c2f60000
2019-04-26 17:56:08.127 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-04-26 17:56:08.135 7f4f2956ae00 1 freelist init
2019-04-26 17:56:08.143 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc opening allocation metadata
2019-04-26 17:56:08.147 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc loaded 223 GiB in 233 extents
2019-04-26 17:56:08.151 7f4f2956ae00 -1 ** Caught signal (Aborted) *
in thread 7f4f2956ae00 thread_name:ceph-osd
ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
1: (()+0x92b730) [0x5634c0151730]
2: (()+0x12890) [0x7f4f1f02b890]
3: (gsignal()+0xc7) [0x7f4f1df06e97]
4: (abort()+0x141) [0x7f4f1df08801]
5: (()+0x8c8b7) [0x7f4f1e8fb8b7]
6: (()+0x92a06) [0x7f4f1e901a06]
7: (()+0x92a41) [0x7f4f1e901a41]
8: (()+0x92c74) [0x7f4f1e901c74]
9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f4f20aff694]
10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f4f20b00af1]
11: (OSDService::try_get_map(unsigned int)+0x508) [0x5634bfbf73a8]
12: (OSDService::get_map(unsigned int)+0x1e) [0x5634bfc56ffe]
13: (OSD::init()+0x1d5f) [0x5634bfc048ef]
14: (main()+0x383d) [0x5634bfaef8cd]
15: (__libc_start_main()+0xe7) [0x7f4f1dee9b97]
16: (_start()+0x2a) [0x5634bfbb97aa]
This might have happened due to high server load when a number of other OSDs came online (but none of the new ones were in the "ssd" class, so it should not have had any direct impact on backfilling of these OSDs).
The SSDs in question are a mix of Samsung SM863/883, 256GB. All Ceph components are 13.2.5, running on Ubuntu 18.04, dual socket Supermicro servers (2*8 cores, 64GB ECC memory) with LSI3008IT controllers in JBOD mode.
We have seen this at least once in the past, and I suspect it might be related to high load (?) in the servers when lots of PGs are peering and/or large amounts of backfilling happens. In that case it was only a single disk, so we "fixed" it by just recreating that OSD - but this time we need to get them working to avoid losing metadata PGs.
We had another OSD fail simultaneously (with the same error) that did not contain any important data at all, which enabled us to play around a bit with ceph-objectstore-tool. Based on the errors above we tested replacing the osdmap with one from a working OSD, but even after this we get an error when trying to read the osdmap again:
root@storage05:/var/log/ceph# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-55 --op get-osdmap --file=/tmp/osdmapterminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: bad crc, actual 3828477398 != expected 3773790681
- Caught signal (Aborted) *
in thread 7f7611698f80 thread_name:ceph-objectstor
ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
1: (()+0x9895c0) [0x562e4aa3d5c0]
2: (()+0x12890) [0x7f760715c890]
3: (gsignal()+0xc7) [0x7f7606033e97]
4: (abort()+0x141) [0x7f7606035801]
5: (()+0x8c8b7) [0x7f7606a288b7]
6: (()+0x92a06) [0x7f7606a2ea06]
7: (()+0x92a41) [0x7f7606a2ea41]
8: (()+0x92c74) [0x7f7606a2ec74]
9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f7607af2694]
10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f7607af3af1]
11: (get_osdmap(ObjectStore, unsigned int, OSDMap&, ceph::buffer::list&)+0x1e5) [0x562e4a473655]
12: (main()+0x39be) [0x562e4a3a43fe]
13: (__libc_start_main()+0xe7) [0x7f7606016b97]
14: (_start()+0x2a) [0x562e4a4727aa]
Aborted (core dumped)
Finally, we used ceph-objectstore-tool to try and export the specific PGs that are offline. This works for some of the PGs, but for others I get the same abort message as above. Irritatingly enough, there is only 1 single PG that we were not able to export from any of the servers, which means we still have a strong incentive to work around this.
And, of course, our other reason for wanting to fix this properly is that we're a bit worried if it's a bug that can cause a chain-effect among OSDs leading to instant unavailability and potential data loss even with replicated storage.
Our next step is to play around with the source and check if it's just a matter of a CRC value not being updated, or if it's some other osdmap-related data structure that has been corrupted.