Project

General

Profile

Support #49847

OSD Fails to init after upgrading to octopus: _deferred_replay failed to decode deferred txn

Added by Eetu Lampsijärvi almost 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

An OSD fails to start after upgrading from mimic 13.2.2 to octopus 15.2.9.

It seems like first bluestore fails at something:
-1 bluestore(/var/lib/ceph/osd/ceph-5) _deferred_replay failed to decode deferred txn 0x0000000000000002

Which causes rocksdb to shut down, bluefs to unmount, and then to fail the init of the OSD:
-1 osd.5 0 OSD:init: unable to mount object store
-1 ** ERROR: osd init failed: (5) Input/output error

Other five nodes & OSDs upgraded without issues. The OSD/node in question got stuck when upgrading the OS with do-release-upgrade from Ubuntu 16 -> 18, which caused some dependency problems, which were solved. Initially I thought this had somehow messed up the ceph packages or configuration as ceph-osd didn't automatically run after a reboot, but reinstallation of the ceph packages and scanning & activating the osd with ceph-volume seemed to remedy that, but we then arrive at the present situation/problem of the OSD failing to init.

I was convinced on IRC to try posting here instead of zapping the osd/disk and letting ceph recover, just in case it's an actual bug and not me missing something very obvious.

ceph-osd.5.log View - OSD Log file with multiple restart attempts (168 KB) Eetu Lampsijärvi, 03/16/2021 07:46 PM

History

#1 Updated by Patrick Donnelly almost 3 years ago

  • Project changed from CephFS to RADOS

#2 Updated by Eetu Lampsijärvi almost 3 years ago

Ended up nuking the OSD & letting it recover - this workaround "solves" the problem for me; feel free to close the issue.

#3 Updated by Eetu Lampsijärvi almost 3 years ago

Contrary to what I stated previously this does not seem like a software issue. The root cause was probably faulty RAM. Just in case someone else encounters similar errors, a few are posted below as a reminder to remember the possibility of sneaky non-disk hardware failure.

Other errors such as
terminate called after throwing an instance of 'ceph::buffer::v15_2_0::malformed_input'
what(): buffer::malformed_input: bad checksum on pg_log_entry_t
[...]
-1 ** Caught signal (Aborted) *

or failing to read OSD superblock were encountered while letting the OSD attempt to recover before realizing a RAM module had gone bad; seems like reinstalling the OS + reconfiguring Ceph + zapping/preparing/activating an OSD did not happen to use the bad memory; only after the OSD started doing memory intensive enough operations, the faulty RAM had a chance to corrupt something somewhere, causing these errors.

#4 Updated by Igor Fedotov almost 3 years ago

  • Status changed from New to Closed

Also available in: Atom PDF