Project

General

Profile

Actions

Bug #39525

closed

lz4 compressor corrupts data when buffers are unaligned

Added by Erik Lindahl almost 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In conjunction with taking a new storage server online we observed that 5 out of the 6 SSDs we use to store metadata (with 3-fold replication) went offline, and crash every time they try to restart with an abort signal in OSDMap::decode, which means our CephFS storage is no longer available.

Here is a brief excerpt from the log:

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:db/MANIFEST-001493 succeeded,manifest_file_number is 1493, next_file_number is 1496, last_sequence is 45904669, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 1491

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 1492

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168125624, "job": 1, "event": "recovery_started", "log_files": [1494]}
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:551] Recovering log #1494 mode 0
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/version_set.cc:2863] Creating manifest 1496

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168126875, "job": 1, "event": "recovery_finished"}
2019-04-26 17:56:08.127 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 0x5634c2f60000
2019-04-26 17:56:08.127 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-04-26 17:56:08.135 7f4f2956ae00 1 freelist init
2019-04-26 17:56:08.143 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc opening allocation metadata
2019-04-26 17:56:08.147 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc loaded 223 GiB in 233 extents
2019-04-26 17:56:08.151 7f4f2956ae00 -1 ** Caught signal (Aborted) *
in thread 7f4f2956ae00 thread_name:ceph-osd

ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
1: (()+0x92b730) [0x5634c0151730]
2: (()+0x12890) [0x7f4f1f02b890]
3: (gsignal()+0xc7) [0x7f4f1df06e97]
4: (abort()+0x141) [0x7f4f1df08801]
5: (()+0x8c8b7) [0x7f4f1e8fb8b7]
6: (()+0x92a06) [0x7f4f1e901a06]
7: (()+0x92a41) [0x7f4f1e901a41]
8: (()+0x92c74) [0x7f4f1e901c74]
9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f4f20aff694]
10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f4f20b00af1]
11: (OSDService::try_get_map(unsigned int)+0x508) [0x5634bfbf73a8]
12: (OSDService::get_map(unsigned int)+0x1e) [0x5634bfc56ffe]
13: (OSD::init()+0x1d5f) [0x5634bfc048ef]
14: (main()+0x383d) [0x5634bfaef8cd]
15: (__libc_start_main()+0xe7) [0x7f4f1dee9b97]
16: (_start()+0x2a) [0x5634bfbb97aa]

This might have happened due to high server load when a number of other OSDs came online (but none of the new ones were in the "ssd" class, so it should not have had any direct impact on backfilling of these OSDs).

The SSDs in question are a mix of Samsung SM863/883, 256GB. All Ceph components are 13.2.5, running on Ubuntu 18.04, dual socket Supermicro servers (2*8 cores, 64GB ECC memory) with LSI3008IT controllers in JBOD mode.

We have seen this at least once in the past, and I suspect it might be related to high load (?) in the servers when lots of PGs are peering and/or large amounts of backfilling happens. In that case it was only a single disk, so we "fixed" it by just recreating that OSD - but this time we need to get them working to avoid losing metadata PGs.

We had another OSD fail simultaneously (with the same error) that did not contain any important data at all, which enabled us to play around a bit with ceph-objectstore-tool. Based on the errors above we tested replacing the osdmap with one from a working OSD, but even after this we get an error when trying to read the osdmap again:

root@storage05:/var/log/ceph# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-55 --op get-osdmap --file=/tmp/osdmap
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: bad crc, actual 3828477398 != expected 3773790681
  • Caught signal (Aborted) *
    in thread 7f7611698f80 thread_name:ceph-objectstor
    ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
    1: (()+0x9895c0) [0x562e4aa3d5c0]
    2: (()+0x12890) [0x7f760715c890]
    3: (gsignal()+0xc7) [0x7f7606033e97]
    4: (abort()+0x141) [0x7f7606035801]
    5: (()+0x8c8b7) [0x7f7606a288b7]
    6: (()+0x92a06) [0x7f7606a2ea06]
    7: (()+0x92a41) [0x7f7606a2ea41]
    8: (()+0x92c74) [0x7f7606a2ec74]
    9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f7607af2694]
    10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f7607af3af1]
    11: (get_osdmap(ObjectStore
    , unsigned int, OSDMap&, ceph::buffer::list&)+0x1e5) [0x562e4a473655]
    12: (main()+0x39be) [0x562e4a3a43fe]
    13: (__libc_start_main()+0xe7) [0x7f7606016b97]
    14: (_start()+0x2a) [0x562e4a4727aa]
    Aborted (core dumped)

Finally, we used ceph-objectstore-tool to try and export the specific PGs that are offline. This works for some of the PGs, but for others I get the same abort message as above. Irritatingly enough, there is only 1 single PG that we were not able to export from any of the servers, which means we still have a strong incentive to work around this.

And, of course, our other reason for wanting to fix this properly is that we're a bit worried if it's a bug that can cause a chain-effect among OSDs leading to instant unavailability and potential data loss even with replicated storage.

Our next step is to play around with the source and check if it's just a matter of a CRC value not being updated, or if it's some other osdmap-related data structure that has been corrupted.


Related issues 3 (0 open3 closed)

Related to RADOS - Bug #41240: All of the cluster SSDs aborted at around the same time and will not start.Can't reproduceBrad Hubbard

Actions
Copied to RADOS - Backport #44489: mimic: lz4 compressor corrupts data when buffers are unalignedRejectedNathan CutlerActions
Copied to RADOS - Backport #44490: nautilus: lz4 compressor corrupts data when buffers are unalignedResolvedNathan CutlerActions
Actions #1

Updated by Erik Lindahl almost 5 years ago

I might have gotten slightly further.

1) On one of the the broken OSDs, the current_epoch is 34626 (clean_thru is also 34626, while last_epoch_mounted is 34561).

2) When I try to extract the osdmap for the same epoch (34626) from a working OSD, I get the same type of abort, which indicates that osdmap epoch itself is broken on (all?) OSDs, rather than just one OSD. Later epochs extract just fine.

3) However, if I try to extract the next epoch (34626) and set that on the broken OSD, I get a message about the version (obviously) not existing:

osdmap (#-1:94e8eb72:::osdmap.34627:0#) does not exist.

I guess this might restrict the problem to simply fixing the osdmap (in the right version?) for the broken OSDs, but right now I have no idea how to achieve that ;-)

Actions #2

Updated by Greg Farnum almost 5 years ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to Correctness/Safety
  • Component(RADOS) OSD added

Hmm are the working and broken OSDs actually running the same binary version? It should work anyway but a bug around that is the only plausible cause I can think of for a map to work on one host but fail when it's transplanted to the other.

Actions #3

Updated by Erik Lindahl almost 5 years ago

Hi Greg,

That might actually explain how it happened originally; we auto-deploy hosts with salt, and noticed that the new host came up with 13.2.5 instead of 13.2.4, which made us upgrade the others to 13.2.5 too. On the other hand, it would be difficult to atomically update all hosts at the very same time without taking the cluster down :-)

The good news is that I think managed to solve it after a bit more hacking. The downed OSDs had slightly different versions of the osdmap as their latest active one. Extracting that version with "ceph osd getmap <epoch>" worked fine, and then I could set it correctly - but for most of the OSDs that was not sufficient. Then I decided to simply extract the last ~20 versions of the osdmap and manually apply all of them to each OSD that refused to start - and that fixed it.

I'm not skilled enough in the Ceph internals to suggest any of that should be done automatically (e.g. by asking the monitor for a newer map?), but it might help to add a log entry about the epoch of the osdmap the OSD is trying to extract - so that when it crashes it is clearer what version of the map one has to set manually?

In any case, I think we can close the immediate bug, although I figured it might be good to have a more extended description here in case anybody else has the same problem. The key is that all the data is actually perfectly fine - one just has to patch several versions of the osdmap to get the OSD to start - and at least for me the trick was to use "ceph osd getmap <epoch>" to get a good version instead of trying to extract it from other OSDs with ceph-objectstore-tool.

Actions #4

Updated by Greg Farnum almost 5 years ago

  • Status changed from New to Need More Info

Was the problem 1) that different OSDs needed different maps, and they had mismatched CRCs when exported from a 13.2.4 OSD to a 13.2.5 one, but exporting from the monitors worked, or 2) that OSDs were actually missing maps they expected to have committed on disk?

It sounds like maybe you're saying the first, which is concerning because it implies there's some kind of OSDMap encoding issue between the point releases, but I don't see any plausible way for that to have gone wrong in skimming through the source and we haven't seen similar reports.

Actions #5

Updated by Erik Lindahl almost 5 years ago

The OSDs definitely had objects corresponding to the maps, but they failed the CRC check when trying to read them. All this happened before I moved anything manually. However, when I extracted the maps from the monitor instead they were in good shape.

So, No: I don't think there is any encoding issue at all. This was more a case of the logs not explicitly saying what epoch of the map needed to be replaced, combined with the user being reluctant to play around too much to avoid nuking all the metadata for ~150TB of storage (you can read that as a euphemism for PEBKAC).

In hindsight, given the fixes that worked, I suspect something like this happened. Caveat: Major speculation mode, since I'm not yet familiar with Ceph internal code, but we'll get there in the future!

1) When taking a new server with ~10 large disks online, backfilling started immediately, and one or more other servers got very high loads. Since the SSDs in particular were capable of moving several 10k's of objects per second, the CPU/memory might not have been able to keep up, and one or more of them crashed - no big deal.

2) However, this in turn might have caused even more data motion that caused other SSD OSDs to crash too, creating a bit of a chain effect, and suddenly five out of six had crashed. The last one then survived because it did not have any other disk to peer with.

3) During the crashes, the latest OSDMap stored on the OSD might have become corrupt (no idea if/when/how that can happen).

4) Since the OSDs did not fail at exactly the same point, there were a handful of different epoch corrupt maps. For a few of the OSDs this was not only limited to the last OSDMap, but also ones a few epochs old were broken - and I guess those were needed either to replay a journal or to peer when the server started.

5) Since the error message was still the same (due to an older map on the OSD also being bad), that first gave the impression that we didn't fix anything by just replacing the last-epoch map that was bad - but when I went back 20 epochs (which might have been overkill) it finally worked.

Actions #6

Updated by Sage Weil about 4 years ago

Notes from CERN incident:

- identical corruption, different OSDmaps on different OSDs:

7467c7467
< 00020080  ff 01 00 05 00 30 0a 83  00 18 00 00 00 00 00 00  |.....0..........|
---
> 00020080  ff 01 00 05 00 30 0a 83  00 18 00 00 00 00 01 00  |.....0..........|
7469c7469
< 000200a0  00 ec 00 00 00 00 01 00  00 13 01 00 00 55 01 00  |.............U..|
---
> 000200a0  00 ec 00 00 00 01 01 00  00 13 01 00 00 55 01 00  |.............U..|
7833,7834c7833,7834
< 00021e10  00 18 00 00 00 00 00 00  00 09 00 00 00 11 00 00  |................|
< 00021e20  00 d4 00 00 00 e1 00 00  00 ec 00 00 00 00 01 00  |................|
---
> 00021e10  00 18 00 00 00 00 01 00  00 09 00 00 00 11 00 00  |................|
> 00021e20  00 d4 00 00 00 e1 00 00  00 ec 00 00 00 01 01 00  |................|

- different epochs

gnit:c (master) 10:06 AM $ ~/src/ceph/build/bin/osdmaptool -p osdmap.2982809.good  |grep epoch
/home/sage/src/ceph/build/bin/osdmaptool: osdmap file 'osdmap.2982809.good'
epoch 2982809
gnit:c (master) 10:06 AM $ ~/src/ceph/build/bin/osdmaptool -p 2/osdmap.2982813.good |grep epoch
/home/sage/src/ceph/build/bin/osdmaptool: osdmap file '2/osdmap.2982813.good'
epoch 2982813

- first crashes

 ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
 1: (()+0xf5f0) [0x7ff61d2a35f0]
 2: (gsignal()+0x37) [0x7ff61c2c3337]
 3: (abort()+0x148) [0x7ff61c2c4a28]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7ff61cbd37d5]
 5: (()+0x5e746) [0x7ff61cbd1746]
 6: (()+0x5e773) [0x7ff61cbd1773]
 7: (()+0x5e993) [0x7ff61cbd1993]
 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7ff6204f668e]
 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7ff6204f7e31]
 10: (OSD::handle_osd_map(MOSDMap*)+0x1c2d) [0x55ab8ca9a24d]
 11: (OSD::_dispatch(Message*)+0xa1) [0x55ab8caa4361]
 12: (OSD::ms_dispatch(Message*)+0x56) [0x55ab8caa46b6]
 13: (DispatchQueue::entry()+0xb7a) [0x7ff6203f024a]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff62048e4bd]
 15: (()+0x7e65) [0x7ff61d29be65]
 16: (clone()+0x6d) [0x7ff61c38b88d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

- second crashes

 ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
 1: (()+0xf5f0) [0x7ffbbbe155f0]
 2: (gsignal()+0x37) [0x7ffbbae35337]
 3: (abort()+0x148) [0x7ffbbae36a28]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7ffbbb7457d5]
 5: (()+0x5e746) [0x7ffbbb743746]
 6: (()+0x5e773) [0x7ffbbb743773]
 7: (()+0x5e993) [0x7ffbbb743993]
 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7ffbbf06868e]
 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7ffbbf069e31]
 10: (OSDService::try_get_map(unsigned int)+0x4f8) [0x562efc272f38]
 11: (OSDService::get_map(unsigned int)+0x1e) [0x562efc2dc00e]
 12: (OSD::init()+0x1def) [0x562efc27e92f]
 13: (main()+0x23d2) [0x562efc160cc2]
 14: (__libc_start_main()+0xf5) [0x7ffbbae21505]
 15: (()+0x378b80) [0x562efc238b80]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

so... somehow the bad osdmap made it to disk, even though the code in handle_osd_map does a decode on both inc and full maps before saving. and that same crc check failed on load.

- possibly bufferlist crc caching is to blame? ...but somehow the same corruption appeared on multiple hosts...

- there was a bunch of bad network packets that correlate with the event. ...but data off the wire has a crc check...

Actions #7

Updated by Sage Weil about 4 years ago

ok, the first crash isn't becuase we just got bad data.. it's because we just read bad data off of disk. see:

2020-02-20 17:25:33.545 7f2ad9eb7700  3 osd.1141 2986674 handle_osd_map epochs [2986675,2986675], i have 2986674, src has [2982564,2986675]
2020-02-20 17:25:33.545 7f2ad9eb7700 10 osd.1141 2986674 handle_osd_map  got inc map for epoch 2986675
2020-02-20 17:25:33.545 7f2ad9eb7700 10 osd.1141 2986674 add_map_bl 2986674 615877 bytes
2020-02-20 17:25:33.547 7f2ae7070700 10 osd.1141 2986674 note_peer_epoch osd.1032 has 2986674
2020-02-20 17:25:33.555 7f2ad9eb7700 10 osd.1141 2986674 write_superblock sb(b4f463a0-c671-43a8-bd36-e40ab8d233d2 osd.1141 285d665d-2838-4398-b5cd-48bfb21736c7 e2986675 [2982564,2986675] lci=[2984241,2986675])
2020-02-20 17:25:33.586 7f2ad5d55700 10 osd.1141 2986674 _committed_osd_maps 2986675..2986675
...

we just wrote 2986675 to the store. and the crc checked out and all in handle_osd_map.
then a bit later we get the next inc map,
2020-02-20 17:25:35.883 7f2ad8eb5700  3 osd.1141 2986675 handle_osd_map epochs [2986676,2986676], i have 2986675, src has [2982564,2986676]
2020-02-20 17:25:35.883 7f2ad8eb5700 10 osd.1141 2986675 handle_osd_map  got inc map for epoch 2986676
2020-02-20 17:25:35.913 7f2ae7871700 10 osd.1141 2986675 note_peer_epoch osd.1021 has 2986676
2020-02-20 17:25:35.913 7f2ae7871700 10 osd.1141 2986675 note_peer_epoch osd.632 has 2986676
2020-02-20 17:25:35.919 7f2ae7871700 10 osd.1141 2986675  new session 0x5627ee8cce00 con=0x5627fc052c00 addr=137.138.24.143:6825/2923257
2020-02-20 17:25:35.919 7f2ae7871700 10 osd.1141 2986675  session 0x5627ee8cce00 osd.632 has caps osdcap[grant(*)] 'allow *'
2020-02-20 17:25:35.919 7f2ae7871700 10 osd.1141 2986675  session 0x5627ee8cce00 osd.632 has caps osdcap[grant(*)] 'allow *'
2020-02-20 17:25:35.919 7f2ae7871700 10 osd.1141 2986675  session 0x5627ef372600 osd.632 has caps osdcap[grant(*)] 'allow *'
2020-02-20 17:25:35.950 7f2acbd41700 10 osd.1141 2986675 dequeue_op 0x5627e2246f00 prio 63 cost 8192 latency 0.000066 osd_op(client.248896512.0:90813362 75.d18 75.6bdded18 (undecoded) ondisk+write+known_if_redirected e2986676) v8 pg pg[75.d18( v 2986667'8894365 (2981644'8891282,2986667'8894365] local-lis/les=2984241/2984242 n=7730 ec=136149/136149 lis/c 2984241/2984241 les/c/f 2984242/2984242/0 2984241/2984241/2984241) [1141,1198,622] r=0 lpr=2984241 crt=2986667'8894365 lcod 2986663'8894364 mlcod 2986663'8894364 active+clean]
2020-02-20 17:25:35.950 7f2acbd41700 10 osd.1141 2986675 dequeue_op 0x5627e2246f00 finish
2020-02-20 17:25:35.955 7f2ad8eb5700 10 osd.1141 2986675 add_map_bl 2986675 615877 bytes
2020-02-20 17:25:35.961 7f2ad8eb5700 -1 *** Caught signal (Aborted) **
 in thread 7f2ad8eb5700 thread_name:ms_dispatch

 ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
 1: (()+0xf5f0) [0x7f2aec2dc5f0]
 2: (gsignal()+0x37) [0x7f2aeb2fc337]
 3: (abort()+0x148) [0x7f2aeb2fda28]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2aebc0c7d5]
 5: (()+0x5e746) [0x7f2aebc0a746]
 6: (()+0x5e773) [0x7f2aebc0a773]
 7: (()+0x5e993) [0x7f2aebc0a993]
 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f2aef52f68e]
 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f2aef530e31]
 10: (OSD::handle_osd_map(MOSDMap*)+0x1c2d) [0x5627d51272dd]
 11: (OSD::_dispatch(Message*)+0xa1) [0x5627d51313f1]
 12: (OSD::ms_dispatch(Message*)+0x56) [0x5627d5131746]
 13: (DispatchQueue::entry()+0xb7a) [0x7f2aef42924a]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f2aef4c74bd]
 15: (()+0x7e65) [0x7f2aec2d4e65]
 16: (clone()+0x6d) [0x7f2aeb3c488d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

and that's where we crash. I'm pretty sure it's here:
      OSDMap *o = new OSDMap;
      if (e > 1) {
    bufferlist obl;
        bool got = get_map_bl(e - 1, obl);
    if (!got) {
      auto p = added_maps_bl.find(e - 1);
      assert(p != added_maps_bl.end());
      obl = p->second;
    }
    o->decode(obl);    <------ HERE
      }

so AFAICS reading the just-written data is the where we are first seeing a problem. this is more consistent with the second crash, where we also see bad data after restarting the OSD.

so... somehow bad data was written, even though the handle_osd_map apply_incremental and full encode generated something that passed the crc check.

Actions #8

Updated by Troy Ablan about 4 years ago

This is continuing to happen for us. Log file here.

ceph-post-file: 589aa7aa-7a80-49a2-ba55-376e467c4550

Actions #9

Updated by Erik Lindahl about 4 years ago

Just to provide the same update I gave to Dan van der Ster over email:

IIRC, we saw this 1-2 times more after the first instance, but at some point we updated to Nautilus, and the last 6-8 months or so we haven't seen it anymore. That could of course be a spurious correlation, but I figured it might help the debugging.

Actions #10

Updated by Dan van der Ster about 4 years ago

Found something. The crc32c for all my good maps is FFFFFFFF (and I assure you they are different maps.. gsutil output below with md5sum for comparison).
All of these maps crashed an osd once persisted:

~/corr $ gsutil hash -h 2982808 2982809 2982810 2982811 2982812 2982813 2982814 2982815 2982816 2982817 2986674 2986675

Hashes [hex] for 2982808:.7 KiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        e407b94ece5568942904ef32c2a83e9b
Hashes [hex] for 2982809:.2 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        363a2e06027ab1d16c1f096c8fccbbbe
Hashes [hex] for 2982810:.8 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        6ab5eecdec83253e69a29c752fca2a09
Hashes [hex] for 2982811:.3 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        f4b33f522bbf6fe5628811684d4e7c0e
Hashes [hex] for 2982812:.9 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        d5d30286fed9c198e03945d6dd639294
Hashes [hex] for 2982813:.5 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        4d971bdc932b8ad6eeadbaedc0b571c4
Hashes [hex] for 2982814:.1 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        5e771e1e70b148e0ce3a7f5fb61248e7
Hashes [hex] for 2982815:.7 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        0f0a5760d25c13d33dbb381626536451
Hashes [hex] for 2982816:.3 MiB]                                                
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        6a7215d4c6d442bab8510215c963e85a
Hashes [hex] for 2982817:5.8 MiB]                                               
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        72d79a96834467e459e8e80282fef481
Hashes [hex] for 2986674:6.4 MiB]                                               
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        da01fbd94305ce986d8eb4e57d132f7f
Hashes [hex] for 2986675:7.0 MiB]                                               
    Hash (crc32c):        FFFFFFFF
    Hash (md5):        7669c8287c2de115aa6f6863468d8abd

Operation completed over 12 objects/7.0 MiB.                                     
Actions #11

Updated by Dan van der Ster about 4 years ago

^^ Is a weird red-herring. The FFFFFFFF is because the osdmap contains the crc32c in the last 4 bytes, so that cancels itself out (with the init val 0).

Actions #12

Updated by Dan van der Ster about 4 years ago

o->decode(obl); <------ HERE

I have gdb working now on a coredump so can confirm that:

#0  0x00007f07b52214bb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
#1  0x0000562673d8c2ad in reraise_fatal (signum=6) at /usr/src/debug/ceph-13.2.7/src/global/signal_handler.cc:80
#2  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-13.2.7/src/global/signal_handler.cc:292
#3  <signal handler called>
#4  0x00007f07b4241337 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#5  0x00007f07b4242a28 in __GI_abort () at abort.c:90
#6  0x00007f07b4b517d5 in __gnu_cxx::__verbose_terminate_handler ()
    at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#7  0x00007f07b4b4f746 in __cxxabiv1::__terminate (handler=<optimized out>)
    at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#8  0x00007f07b4b4f773 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007f07b4b4f993 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x562678cc0be0, 
    tinfo=0x7f07b8a91e88 <typeinfo for ceph::buffer::malformed_input>, 
    dest=dest@entry=0x7f07b82639c0 <ceph::buffer::malformed_input::~malformed_input()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:87
#10 0x00007f07b847468e in OSDMap::decode (this=this@entry=0x5626d203c400, bl=...)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSDMap.cc:3070
#11 0x00007f07b8475e31 in OSDMap::decode (this=this@entry=0x5626d203c400, bl=...)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSDMap.cc:2773
#12 0x000056267383a24d in OSD::handle_osd_map (this=this@entry=0x56267684c000, m=m@entry=0x5626e07ef180)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSD.cc:7719
#13 0x0000562673844361 in OSD::_dispatch (this=this@entry=0x56267684c000, m=m@entry=0x5626e07ef180)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSD.cc:7091
---Type <return> to continue, or q <return> to quit---
#14 0x00005626738446b6 in OSD::ms_dispatch (this=0x56267684c000, m=0x5626e07ef180)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSD.cc:6763
#15 0x00007f07b836e24a in ms_deliver_dispatch (m=0x5626e07ef180, this=0x5626763b0700)
    at /usr/src/debug/ceph-13.2.7/src/msg/Messenger.h:667
#16 DispatchQueue::entry (this=0x5626763b0858) at /usr/src/debug/ceph-13.2.7/src/msg/DispatchQueue.cc:201
#17 0x00007f07b840c4bd in DispatchQueue::DispatchThread::entry (this=<optimized out>)
    at /usr/src/debug/ceph-13.2.7/src/msg/DispatchQueue.h:101
#18 0x00007f07b5219e65 in start_thread (arg=0x7f07a1dfc700) at pthread_create.c:307
#19 0x00007f07b430988d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Some info:

(gdb) f 10
#10 0x00007f07b847468e in OSDMap::decode (this=this@entry=0x5626d203c400, bl=...)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSDMap.cc:3070
3070          throw buffer::malformed_input(s.c_str());
(gdb) list
3065        }
3066        if (crc != actual) {
3067          ostringstream ss;
3068          ss << "bad crc, actual " << actual << " != expected " << crc;
3069          string s = ss.str();
3070          throw buffer::malformed_input(s.c_str());
3071        }
3072      }
3073    
3074      post_decode();
(gdb) p epoch
$1 = 2982809
(gdb) p s
$2 = "bad crc, actual 822724616 != expected 2334082500" 
(gdb) up
#11 0x00007f07b8475e31 in OSDMap::decode (this=this@entry=0x5626d203c400, bl=...)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSDMap.cc:2773
2773      decode(p);
(gdb) up
#12 0x000056267383a24d in OSD::handle_osd_map (this=this@entry=0x56267684c000, m=m@entry=0x5626e07ef180)
    at /usr/src/debug/ceph-13.2.7/src/osd/OSD.cc:7719
7719        o->decode(obl);
(gdb) list
7714        if (!got) {
7715          auto p = added_maps_bl.find(e - 1);
7716          assert(p != added_maps_bl.end());
7717          obl = p->second;
7718        }
7719        o->decode(obl);
7720          }
7721    
7722          OSDMap::Incremental inc;
7723          bufferlist::iterator p = bl.begin();
(gdb) 

BTW I had forgotten that we compute crc differently from off-the-shelf tools (https://tracker.ceph.com/issues/13713#note-6)

# ./ceph-crc32c.py osdmap.2982809.good osdmap.2982809.bad
osdmap.2982809.good     calculated_crc: 2334082500     tail_crc: 2334082500
osdmap.2982809.bad     calculated_crc: 822724616     tail_crc: 2334082500

So the osdmap I extracted with ceph-objectstore-tool is identical to the one in this coredump.
Now, how to step back in time to see how 2982809 was corrupted before being written.

As suggested we have `osd_map_dedup = false` and didn't see another crash since Thursday.

Actions #13

Updated by Dan van der Ster about 4 years ago

A bit more about our incident ftr.

The cluster has 1301 osds in total: 752 filestore and 549 bluestore. The filestore osds have data on hdd with an ssd journal -- bluestore have data on hdd and block.db on ssd. There are two models SSD in the cluster: SAMSUNG MZ7KM240 (SM863a) and INTEL SSDSC2BA20 (DC S3700).

Timeline:

  • 2020-02-20 10:10:35.012: 1291 osds were in, then the OSDMap::decode started crashing osds.
  • 2020-02-20 10:11:01.270: 927 osds were in. No more osds crashed.
  • 2020-02-20 13:06:43.862: we started booting osds again, completing by around 19:00.

So 364 OSDs crashed in OSDMap::decode. I didn't find a single filestore osd who crashed -- only bluestore was affected.

And it seems that all the corruptions were indeed identical across epochs. Using bsdiff/bspatch I applied the same bit-flips to good osdmaps and found the resulting broken crc's in the crash logs. E.g.

osdmap.2982810.bad     calculated_crc: 1587500289     tail_crc: 3834239181
osdmap.2982811.bad calculated_crc: 3112234676 tail_crc: 60187512
osdmap.2982812.bad calculated_crc: 296463260 tail_crc: 2881300048
---
Feb 20 10:22:08 X ceph-osd[2363469]: what(): buffer::malformed_input: bad crc, actual 3112234676 != expected 60187512
Feb 20 10:22:10 X ceph-osd[2363736]: what(): buffer::malformed_input: bad crc, actual 296463260 != expected 2881300048
Feb 20 10:22:12 X ceph-osd[2363876]: what(): buffer::malformed_input: bad crc, actual 1587500289 != expected 3834239181

Is there any mechanism by which an ssd, memory, or packet corruption could have propagated errors in this way across hosts affecting only bluestore? Or is the evidence strongly pointing at a bluestore issue?

Sage pointed out that https://tracker.ceph.com/issues/43903 might be relevant. I'm working on a backport to nautilus (mimic is more difficult): https://github.com/ceph/ceph/pull/33483

I also found `ed6bc9f624 OSD: OSDMapRef access by multiple threads is unsafe` which isn't in mimic. But that seems related to a scrubbing crash rather than a source of corrupting osdmaps.

Actions #14

Updated by Dan van der Ster about 4 years ago

More ftr: the corruption occurs in the crush part of the osdmap:

 diff <(hexdump -vC 2982813.crush.good) <(hexdump -vC 2982813.crush.bad)
9c9
< 00000080  01 00 05 00 30 0a 83 00  18 00 00 00 00 00 00 00  |....0...........|
---
> 00000080  01 00 05 00 30 0a 83 00  18 00 00 00 00 01 00 00  |....0...........|
11c11
< 000000a0  ec 00 00 00 00 01 00 00  13 01 00 00 55 01 00 00  |............U...|
---
> 000000a0  ec 00 00 00 01 01 00 00  13 01 00 00 55 01 00 00  |............U...|
482,483c482,483
< 00001e10  18 00 00 00 00 00 00 00  09 00 00 00 11 00 00 00  |................|
< 00001e20  d4 00 00 00 e1 00 00 00  ec 00 00 00 00 01 00 00  |................|
---
> 00001e10  18 00 00 00 00 01 00 00  09 00 00 00 11 00 00 00  |................|
> 00001e20  d4 00 00 00 e1 00 00 00  ec 00 00 00 01 01 00 00  |................|

$ diff <(crushtool -d 2982813.crush.good) <(crushtool -d 2982813.crush.bad)
1559c1559
<     item osd.0 weight 5.460
---
>     item osd.256 weight 5.460
1565c1565
<     item osd.256 weight 5.460
---
>     item osd.257 weight 5.460
Actions #15

Updated by Brad Hubbard about 4 years ago

  • Related to Bug #41240: All of the cluster SSDs aborted at around the same time and will not start. added
Actions #16

Updated by Dan van der Ster about 4 years ago

Might be a stretch, but I just noticed that our bits are flipped nearby the 128k boundary, which is ?coincidentally? the "bluestore_compression_min_blob_size_hdd": "131072".
We've had lz4 aggressively enabled on this cluster since December. Troy and Eric were/are you using bluestore compression?

Actions #17

Updated by Dan van der Ster about 4 years ago

I got confirmation from Troy and Erik -- both are using lz4 compression like us.

I'm trying to reproduce using unittest_compressor.

Actions #18

Updated by Erik Lindahl about 4 years ago

Hi,

Oops; Sorry Dan, but I just realised I misled you. While we do have aggressive LZ4 enabled by default (in particular for all large HDD storage pools), the particular SSDs on which we store the CephFS metadata do not have any compression enabled - and that's where we saw the crash.

I'm fairly certain we didn't change that since the crash, so in that case it shouldn't be due to the compression.

Actions #19

Updated by Dan van der Ster about 4 years ago

Erik Lindahl wrote:

Hi,

Oops; Sorry Dan, but I just realised I misled you. While we do have aggressive LZ4 enabled by default (in particular for all large HDD storage pools), the particular SSDs on which we store the CephFS metadata do not have any compression enabled - and that's where we saw the crash.

I'm fairly certain we didn't change that since the crash, so in that case it shouldn't be due to the compression.

Are you sure your ssd's weren't compressing at the time? We now have a reproducer with oldish lz4 libraries.

Actions #20

Updated by Erik Lindahl about 4 years ago

No, not positive. Very early on we did play around with compression also for the metadata, but in the end decided latency was more important there. I simply don't recall if we changed that before or after this bug.

Actions #21

Updated by Sage Weil about 4 years ago

  • Subject changed from Several Bluestore OSDs crashing during startup in OSDMap::decode after taking new OSDs online to lz4 compressor corrupts data when buffers are unaligned
  • Status changed from Need More Info to Fix Under Review
  • Backport set to nautilus,mimic
  • Pull request ID set to 33584

Thanks to Dan we have a reproducer! I cleaned it up a bit, rebased on master, added a workaround to the LZ4 plugin, and the problem is "fixed".

https://github.com/ceph/ceph/pull/33584

Actions #22

Updated by Kefu Chai about 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #23

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #44489: mimic: lz4 compressor corrupts data when buffers are unaligned added
Actions #24

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #44490: nautilus: lz4 compressor corrupts data when buffers are unaligned added
Actions #25

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF