Bug #46847: Loss of placement information on OSD reboot - RADOS - Ceph

Actions

Copy link

Bug #46847

open

Loss of placement information on OSD reboot

Added by Frank Schilder over 3 years ago. Updated over 1 year ago.

Status:

Need More Info

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

quincy,pacific,octopus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v13.2.8, Ceph - v14.2.19

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

During rebalancing after adding new disks to a cluster, the cluster looses placement information on reboot of an "old" OSD. This results in an unnecessary and long-lasting degradation of redundancy. See also: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/C7C4W4APGGQL7VAJP4S3KEGL4RMUN4LZ/

How to reproduce:

- Start with a healthy cluster with crush map A. All OSDs present at this time are referred to as "old" OSDs.
- Add new disks to the cluster, which will now have a new crush map B. All OSDs added in this step are referred to as "new" OSDs.
- Let the rebalancing run for a while, execute "ceph osd set norebalance" and wait for recovery I/O to stop.
- Cluster should now be in health warn showing misplaced objects and the OSD flag. There should be no other warning(s).
- Stop one of the "old" OSDs and wait for peering to finish. A lot of PGs will go undersized/degraded.
- Start the OSD again and wait for peering to finish. A lot of PGs will remain undersized/degraded despite the fact that everything is restored and all the missing objects are actually present.

Interestingly, one can manually recover from this by:

- Restoring the crush sub-tree of map A by moving all "new" OSDs out of the crush tree to a different root.
- Wait for peering to complete. At this point, undersized/degraded PGs should already be gone.
- Restore crush map B by moving the "new" OSDs back.
- The cluster status will recover to the status from before shutting down the "old" OSD.

An archive with logs demonstrating the above observations is attached to this case. The log shows the result of restarting an "old" and a "new" OSD. The cluster recovers as expected when a "new" OSD, but not when an "old" OSD is restarted.

Hypothesis:

During a rebalancing operation as described above, computing all possible locations of objects requires more than one crush map as some objects are still placed according to map A while others are already placed according to map B. Ceph either needs to store all crush maps that have active placements, or store all re-mappings based on an old crush map until either (A) the remapped PG has completed rebalancing or (B) an "old" OSD that is part of the source PG is permanently purged from the cluster. In all other circumstances, such re-mappings must be preserved. However, it looks like that these re-mappings are currently removed immediately when an OSD goes down and can then not be recovered, because crush map A is no longer available for re-computing the placements.

An alternative would be that a booting OSD checks if it holds objects that are on the cluster's "wanted list" to restore re-mappings.

Workaround:

If someone is going to execute a larger rebalancing operation, create a crush map A' based on crush map A with all "new" OSDs located under a separate root and save it somewhere. In case redundancy is lost due to a reboot, network outage, etc, use this to perform the redundancy recovery operation as described above in the second part.

Files

Download all files

logs02.tgz (377 KB) logs02.tgz	Logs of OSD restarts and manual rebuild of placement information.	Frank Schilder, 08/06/2020 01:50 PM
sess.log (41.9 KB) sess.log	Session log with annotations.	Frank Schilder, 08/11/2020 12:04 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #46847

Loss of placement information on OSD reboot

Updated by Jonas Jelten over 3 years ago

Updated by Jonas Jelten over 3 years ago

Updated by Frank Schilder over 3 years ago

Updated by Frank Schilder over 3 years ago

Updated by Jonas Jelten over 3 years ago

Updated by Frank Schilder about 3 years ago

Updated by Jonas Jelten about 3 years ago

Updated by Dan van der Ster about 3 years ago

Updated by Josh Durgin about 3 years ago

Updated by Dan van der Ster about 3 years ago

Updated by Jonas Jelten almost 3 years ago

Updated by Frank Schilder almost 3 years ago

Updated by Dan van der Ster almost 3 years ago

Updated by Neha Ojha over 2 years ago

Updated by Malcolm Haak about 2 years ago

Updated by Malcolm Haak about 2 years ago

Updated by Malcolm Haak about 2 years ago

Updated by Neha Ojha about 2 years ago

Updated by Frank Schilder about 2 years ago

Updated by Neha Ojha about 2 years ago

Updated by Malcolm Haak about 2 years ago

Updated by Malcolm Haak about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Yao Ning almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Frank Schilder over 1 year ago