OSD Fails to init after upgrading to octopus: _deferred_replay failed to decode deferred txn
An OSD fails to start after upgrading from mimic 13.2.2 to octopus 15.2.9.
It seems like first bluestore fails at something:
-1 bluestore(/var/lib/ceph/osd/ceph-5) _deferred_replay failed to decode deferred txn 0x0000000000000002
Which causes rocksdb to shut down, bluefs to unmount, and then to fail the init of the OSD:
-1 osd.5 0 OSD:init: unable to mount object store
-1 ** ERROR: osd init failed: (5) Input/output error
Other five nodes & OSDs upgraded without issues. The OSD/node in question got stuck when upgrading the OS with do-release-upgrade from Ubuntu 16 -> 18, which caused some dependency problems, which were solved. Initially I thought this had somehow messed up the ceph packages or configuration as ceph-osd didn't automatically run after a reboot, but reinstallation of the ceph packages and scanning & activating the osd with ceph-volume seemed to remedy that, but we then arrive at the present situation/problem of the OSD failing to init.
I was convinced on IRC to try posting here instead of zapping the osd/disk and letting ceph recover, just in case it's an actual bug and not me missing something very obvious.
#3 Updated by Eetu Lampsijärvi almost 3 years ago
Contrary to what I stated previously this does not seem like a software issue. The root cause was probably faulty RAM. Just in case someone else encounters similar errors, a few are posted below as a reminder to remember the possibility of sneaky non-disk hardware failure.
Other errors such as
terminate called after throwing an instance of 'ceph::buffer::v15_2_0::malformed_input'
what(): buffer::malformed_input: bad checksum on pg_log_entry_t
-1 ** Caught signal (Aborted) *
or failing to read OSD superblock were encountered while letting the OSD attempt to recover before realizing a RAM module had gone bad; seems like reinstalling the OS + reconfiguring Ceph + zapping/preparing/activating an OSD did not happen to use the bad memory; only after the OSD started doing memory intensive enough operations, the faulty RAM had a chance to corrupt something somewhere, causing these errors.