Support #49847: OSD Fails to init after upgrading to octopus: _deferred_replay failed to decode deferred txn - RADOS - Ceph

Actions

Copy link

Support #49847

closed

OSD Fails to init after upgrading to octopus: _deferred_replay failed to decode deferred txn

Added by Eetu Lampsijärvi about 3 years ago. Updated almost 3 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v15.2.9

Component(RADOS):

Pull request ID:

Description

An OSD fails to start after upgrading from mimic 13.2.2 to octopus 15.2.9.

It seems like first bluestore fails at something:
-1 bluestore(/var/lib/ceph/osd/ceph-5) _deferred_replay failed to decode deferred txn 0x0000000000000002

Which causes rocksdb to shut down, bluefs to unmount, and then to fail the init of the OSD:
-1 osd.5 0 OSD:init: unable to mount object store
-1 ** ERROR: osd init failed: (5) Input/output error

Other five nodes & OSDs upgraded without issues. The OSD/node in question got stuck when upgrading the OS with do-release-upgrade from Ubuntu 16 -> 18, which caused some dependency problems, which were solved. Initially I thought this had somehow messed up the ceph packages or configuration as ceph-osd didn't automatically run after a reboot, but reinstallation of the ceph packages and scanning & activating the osd with ceph-volume seemed to remedy that, but we then arrive at the present situation/problem of the OSD failing to init.

I was convinced on IRC to try posting here instead of zapping the osd/disk and letting ceph recover, just in case it's an actual bug and not me missing something very obvious.

Files

ceph-osd.5.log (168 KB) ceph-osd.5.log

OSD Log file with multiple restart attempts

Eetu Lampsijärvi, 03/16/2021 07:46 PM

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Project changed from CephFS to RADOS

Actions

Copy link

Updated by Eetu Lampsijärvi about 3 years ago

Ended up nuking the OSD & letting it recover - this workaround "solves" the problem for me; feel free to close the issue.

Actions

Copy link

Updated by Eetu Lampsijärvi about 3 years ago

Contrary to what I stated previously this does not seem like a software issue. The root cause was probably faulty RAM. Just in case someone else encounters similar errors, a few are posted below as a reminder to remember the possibility of sneaky non-disk hardware failure.

Other errors such as
terminate called after throwing an instance of 'ceph::buffer::v15_2_0::malformed_input'
what(): buffer::malformed_input: bad checksum on pg_log_entry_t
[...]
-1 ** Caught signal (Aborted) *

or failing to read OSD superblock were encountered while letting the OSD attempt to recover before realizing a RAM module had gone bad; seems like reinstalling the OS + reconfiguring Ceph + zapping/preparing/activating an OSD did not happen to use the bad memory; only after the OSD started doing memory intensive enough operations, the faulty RAM had a chance to corrupt something somewhere, causing these errors.

Actions

Copy link