Support #49847
closedOSD Fails to init after upgrading to octopus: _deferred_replay failed to decode deferred txn
0%
Description
An OSD fails to start after upgrading from mimic 13.2.2 to octopus 15.2.9.
It seems like first bluestore fails at something:
-1 bluestore(/var/lib/ceph/osd/ceph-5) _deferred_replay failed to decode deferred txn 0x0000000000000002
Which causes rocksdb to shut down, bluefs to unmount, and then to fail the init of the OSD:
-1 osd.5 0 OSD:init: unable to mount object store
-1 ** ERROR: osd init failed: (5) Input/output error
Other five nodes & OSDs upgraded without issues. The OSD/node in question got stuck when upgrading the OS with do-release-upgrade from Ubuntu 16 -> 18, which caused some dependency problems, which were solved. Initially I thought this had somehow messed up the ceph packages or configuration as ceph-osd didn't automatically run after a reboot, but reinstallation of the ceph packages and scanning & activating the osd with ceph-volume seemed to remedy that, but we then arrive at the present situation/problem of the OSD failing to init.
I was convinced on IRC to try posting here instead of zapping the osd/disk and letting ceph recover, just in case it's an actual bug and not me missing something very obvious.
Files
Updated by Patrick Donnelly about 3 years ago
- Project changed from CephFS to RADOS
Updated by Eetu Lampsijärvi about 3 years ago
Ended up nuking the OSD & letting it recover - this workaround "solves" the problem for me; feel free to close the issue.
Updated by Eetu Lampsijärvi about 3 years ago
Contrary to what I stated previously this does not seem like a software issue. The root cause was probably faulty RAM. Just in case someone else encounters similar errors, a few are posted below as a reminder to remember the possibility of sneaky non-disk hardware failure.
Other errors such as
terminate called after throwing an instance of 'ceph::buffer::v15_2_0::malformed_input'
what(): buffer::malformed_input: bad checksum on pg_log_entry_t
[...]
-1 ** Caught signal (Aborted) *
or failing to read OSD superblock were encountered while letting the OSD attempt to recover before realizing a RAM module had gone bad; seems like reinstalling the OS + reconfiguring Ceph + zapping/preparing/activating an OSD did not happen to use the bad memory; only after the OSD started doing memory intensive enough operations, the faulty RAM had a chance to corrupt something somewhere, causing these errors.