Bug #39525

Several Bluestore OSDs crashing during startup in OSDMap::decode after taking new OSDs online

Added by Erik Lindahl 4 months ago. Updated 4 months ago.

Need More Info
Target version:
Start date:
Due date:
% Done:


Community (user)
3 - minor
Affected Versions:
Pull request ID:


In conjunction with taking a new storage server online we observed that 5 out of the 6 SSDs we use to store metadata (with 3-fold replication) went offline, and crash every time they try to restart with an abort signal in OSDMap::decode, which means our CephFS storage is no longer available.

Here is a brief excerpt from the log:

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/] Recovered from manifest file:db/MANIFEST-001493 succeeded,manifest_file_number is 1493, next_file_number is 1496, last_sequence is 45904669, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 1491

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/] Column family [default] (ID 0), log number is 1492

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168125624, "job": 1, "event": "recovery_started", "log_files": [1494]}
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/] Recovering log #1494 mode 0
2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/] Creating manifest 1496

2019-04-26 17:56:08.123 7f4f2956ae00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1556294168126875, "job": 1, "event": "recovery_finished"}
2019-04-26 17:56:08.127 7f4f2956ae00 4 rocksdb: [/build/ceph-13.2.5/src/rocksdb/db/] DB pointer 0x5634c2f60000
2019-04-26 17:56:08.127 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_db opened rocksdb path db options compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-04-26 17:56:08.135 7f4f2956ae00 1 freelist init
2019-04-26 17:56:08.143 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc opening allocation metadata
2019-04-26 17:56:08.147 7f4f2956ae00 1 bluestore(/var/lib/ceph/osd/ceph-126) _open_alloc loaded 223 GiB in 233 extents
2019-04-26 17:56:08.151 7f4f2956ae00 -1 ** Caught signal (Aborted) *
in thread 7f4f2956ae00 thread_name:ceph-osd

ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
1: (()+0x92b730) [0x5634c0151730]
2: (()+0x12890) [0x7f4f1f02b890]
3: (gsignal()+0xc7) [0x7f4f1df06e97]
4: (abort()+0x141) [0x7f4f1df08801]
5: (()+0x8c8b7) [0x7f4f1e8fb8b7]
6: (()+0x92a06) [0x7f4f1e901a06]
7: (()+0x92a41) [0x7f4f1e901a41]
8: (()+0x92c74) [0x7f4f1e901c74]
9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f4f20aff694]
10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f4f20b00af1]
11: (OSDService::try_get_map(unsigned int)+0x508) [0x5634bfbf73a8]
12: (OSDService::get_map(unsigned int)+0x1e) [0x5634bfc56ffe]
13: (OSD::init()+0x1d5f) [0x5634bfc048ef]
14: (main()+0x383d) [0x5634bfaef8cd]
15: (__libc_start_main()+0xe7) [0x7f4f1dee9b97]
16: (_start()+0x2a) [0x5634bfbb97aa]

This might have happened due to high server load when a number of other OSDs came online (but none of the new ones were in the "ssd" class, so it should not have had any direct impact on backfilling of these OSDs).

The SSDs in question are a mix of Samsung SM863/883, 256GB. All Ceph components are 13.2.5, running on Ubuntu 18.04, dual socket Supermicro servers (2*8 cores, 64GB ECC memory) with LSI3008IT controllers in JBOD mode.

We have seen this at least once in the past, and I suspect it might be related to high load (?) in the servers when lots of PGs are peering and/or large amounts of backfilling happens. In that case it was only a single disk, so we "fixed" it by just recreating that OSD - but this time we need to get them working to avoid losing metadata PGs.

We had another OSD fail simultaneously (with the same error) that did not contain any important data at all, which enabled us to play around a bit with ceph-objectstore-tool. Based on the errors above we tested replacing the osdmap with one from a working OSD, but even after this we get an error when trying to read the osdmap again:

root@storage05:/var/log/ceph# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-55 --op get-osdmap --file=/tmp/osdmap
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: bad crc, actual 3828477398 != expected 3773790681
  • Caught signal (Aborted) *
    in thread 7f7611698f80 thread_name:ceph-objectstor
    ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
    1: (()+0x9895c0) [0x562e4aa3d5c0]
    2: (()+0x12890) [0x7f760715c890]
    3: (gsignal()+0xc7) [0x7f7606033e97]
    4: (abort()+0x141) [0x7f7606035801]
    5: (()+0x8c8b7) [0x7f7606a288b7]
    6: (()+0x92a06) [0x7f7606a2ea06]
    7: (()+0x92a41) [0x7f7606a2ea41]
    8: (()+0x92c74) [0x7f7606a2ec74]
    9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1864) [0x7f7607af2694]
    10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f7607af3af1]
    11: (get_osdmap(ObjectStore
    , unsigned int, OSDMap&, ceph::buffer::list&)+0x1e5) [0x562e4a473655]
    12: (main()+0x39be) [0x562e4a3a43fe]
    13: (__libc_start_main()+0xe7) [0x7f7606016b97]
    14: (_start()+0x2a) [0x562e4a4727aa]
    Aborted (core dumped)

Finally, we used ceph-objectstore-tool to try and export the specific PGs that are offline. This works for some of the PGs, but for others I get the same abort message as above. Irritatingly enough, there is only 1 single PG that we were not able to export from any of the servers, which means we still have a strong incentive to work around this.

And, of course, our other reason for wanting to fix this properly is that we're a bit worried if it's a bug that can cause a chain-effect among OSDs leading to instant unavailability and potential data loss even with replicated storage.

Our next step is to play around with the source and check if it's just a matter of a CRC value not being updated, or if it's some other osdmap-related data structure that has been corrupted.


#1 Updated by Erik Lindahl 4 months ago

I might have gotten slightly further.

1) On one of the the broken OSDs, the current_epoch is 34626 (clean_thru is also 34626, while last_epoch_mounted is 34561).

2) When I try to extract the osdmap for the same epoch (34626) from a working OSD, I get the same type of abort, which indicates that osdmap epoch itself is broken on (all?) OSDs, rather than just one OSD. Later epochs extract just fine.

3) However, if I try to extract the next epoch (34626) and set that on the broken OSD, I get a message about the version (obviously) not existing:

osdmap (#-1:94e8eb72:::osdmap.34627:0#) does not exist.

I guess this might restrict the problem to simply fixing the osdmap (in the right version?) for the broken OSDs, but right now I have no idea how to achieve that ;-)

#2 Updated by Greg Farnum 4 months ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to Correctness/Safety
  • Component(RADOS) OSD added

Hmm are the working and broken OSDs actually running the same binary version? It should work anyway but a bug around that is the only plausible cause I can think of for a map to work on one host but fail when it's transplanted to the other.

#3 Updated by Erik Lindahl 4 months ago

Hi Greg,

That might actually explain how it happened originally; we auto-deploy hosts with salt, and noticed that the new host came up with 13.2.5 instead of 13.2.4, which made us upgrade the others to 13.2.5 too. On the other hand, it would be difficult to atomically update all hosts at the very same time without taking the cluster down :-)

The good news is that I think managed to solve it after a bit more hacking. The downed OSDs had slightly different versions of the osdmap as their latest active one. Extracting that version with "ceph osd getmap <epoch>" worked fine, and then I could set it correctly - but for most of the OSDs that was not sufficient. Then I decided to simply extract the last ~20 versions of the osdmap and manually apply all of them to each OSD that refused to start - and that fixed it.

I'm not skilled enough in the Ceph internals to suggest any of that should be done automatically (e.g. by asking the monitor for a newer map?), but it might help to add a log entry about the epoch of the osdmap the OSD is trying to extract - so that when it crashes it is clearer what version of the map one has to set manually?

In any case, I think we can close the immediate bug, although I figured it might be good to have a more extended description here in case anybody else has the same problem. The key is that all the data is actually perfectly fine - one just has to patch several versions of the osdmap to get the OSD to start - and at least for me the trick was to use "ceph osd getmap <epoch>" to get a good version instead of trying to extract it from other OSDs with ceph-objectstore-tool.

#4 Updated by Greg Farnum 4 months ago

  • Status changed from New to Need More Info

Was the problem 1) that different OSDs needed different maps, and they had mismatched CRCs when exported from a 13.2.4 OSD to a 13.2.5 one, but exporting from the monitors worked, or 2) that OSDs were actually missing maps they expected to have committed on disk?

It sounds like maybe you're saying the first, which is concerning because it implies there's some kind of OSDMap encoding issue between the point releases, but I don't see any plausible way for that to have gone wrong in skimming through the source and we haven't seen similar reports.

#5 Updated by Erik Lindahl 4 months ago

The OSDs definitely had objects corresponding to the maps, but they failed the CRC check when trying to read them. All this happened before I moved anything manually. However, when I extracted the maps from the monitor instead they were in good shape.

So, No: I don't think there is any encoding issue at all. This was more a case of the logs not explicitly saying what epoch of the map needed to be replaced, combined with the user being reluctant to play around too much to avoid nuking all the metadata for ~150TB of storage (you can read that as a euphemism for PEBKAC).

In hindsight, given the fixes that worked, I suspect something like this happened. Caveat: Major speculation mode, since I'm not yet familiar with Ceph internal code, but we'll get there in the future!

1) When taking a new server with ~10 large disks online, backfilling started immediately, and one or more other servers got very high loads. Since the SSDs in particular were capable of moving several 10k's of objects per second, the CPU/memory might not have been able to keep up, and one or more of them crashed - no big deal.

2) However, this in turn might have caused even more data motion that caused other SSD OSDs to crash too, creating a bit of a chain effect, and suddenly five out of six had crashed. The last one then survived because it did not have any other disk to peer with.

3) During the crashes, the latest OSDMap stored on the OSD might have become corrupt (no idea if/when/how that can happen).

4) Since the OSDs did not fail at exactly the same point, there were a handful of different epoch corrupt maps. For a few of the OSDs this was not only limited to the last OSDMap, but also ones a few epochs old were broken - and I guess those were needed either to replay a journal or to peer when the server started.

5) Since the error message was still the same (due to an older map on the OSD also being bad), that first gave the impression that we didn't fix anything by just replacing the last-epoch map that was bad - but when I went back 20 epochs (which might have been overkill) it finally worked.

Also available in: Atom PDF