Project

General

Profile

Actions

Bug #46847

open

Loss of placement information on OSD reboot

Added by Frank Schilder over 3 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific,octopus
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During rebalancing after adding new disks to a cluster, the cluster looses placement information on reboot of an "old" OSD. This results in an unnecessary and long-lasting degradation of redundancy. See also: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/C7C4W4APGGQL7VAJP4S3KEGL4RMUN4LZ/

How to reproduce:

- Start with a healthy cluster with crush map A. All OSDs present at this time are referred to as "old" OSDs.
- Add new disks to the cluster, which will now have a new crush map B. All OSDs added in this step are referred to as "new" OSDs.
- Let the rebalancing run for a while, execute "ceph osd set norebalance" and wait for recovery I/O to stop.
- Cluster should now be in health warn showing misplaced objects and the OSD flag. There should be no other warning(s).
- Stop one of the "old" OSDs and wait for peering to finish. A lot of PGs will go undersized/degraded.
- Start the OSD again and wait for peering to finish. A lot of PGs will remain undersized/degraded despite the fact that everything is restored and all the missing objects are actually present.

Interestingly, one can manually recover from this by:

- Restoring the crush sub-tree of map A by moving all "new" OSDs out of the crush tree to a different root.
- Wait for peering to complete. At this point, undersized/degraded PGs should already be gone.
- Restore crush map B by moving the "new" OSDs back.
- The cluster status will recover to the status from before shutting down the "old" OSD.

An archive with logs demonstrating the above observations is attached to this case. The log shows the result of restarting an "old" and a "new" OSD. The cluster recovers as expected when a "new" OSD, but not when an "old" OSD is restarted.

Hypothesis:

During a rebalancing operation as described above, computing all possible locations of objects requires more than one crush map as some objects are still placed according to map A while others are already placed according to map B. Ceph either needs to store all crush maps that have active placements, or store all re-mappings based on an old crush map until either (A) the remapped PG has completed rebalancing or (B) an "old" OSD that is part of the source PG is permanently purged from the cluster. In all other circumstances, such re-mappings must be preserved. However, it looks like that these re-mappings are currently removed immediately when an OSD goes down and can then not be recovered, because crush map A is no longer available for re-computing the placements.

An alternative would be that a booting OSD checks if it holds objects that are on the cluster's "wanted list" to restore re-mappings.

Workaround:

If someone is going to execute a larger rebalancing operation, create a crush map A' based on crush map A with all "new" OSDs located under a separate root and save it somewhere. In case redundancy is lost due to a reboot, network outage, etc, use this to perform the redundancy recovery operation as described above in the second part.


Files

logs02.tgz (377 KB) logs02.tgz Logs of OSD restarts and manual rebuild of placement information. Frank Schilder, 08/06/2020 01:50 PM
sess.log (41.9 KB) sess.log Session log with annotations. Frank Schilder, 08/11/2020 12:04 PM

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #37439: Degraded PG does not discover remapped data on originating OSDResolved11/28/2018

Actions
Actions

Also available in: Atom PDF