Project

General

Profile

Bug #46847

Loss of placement information on OSD reboot

Added by Frank Schilder 4 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

During rebalancing after adding new disks to a cluster, the cluster looses placement information on reboot of an "old" OSD. This results in an unnecessary and long-lasting degradation of redundancy. See also: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/C7C4W4APGGQL7VAJP4S3KEGL4RMUN4LZ/

How to reproduce:

- Start with a healthy cluster with crush map A. All OSDs present at this time are referred to as "old" OSDs.
- Add new disks to the cluster, which will now have a new crush map B. All OSDs added in this step are referred to as "new" OSDs.
- Let the rebalancing run for a while, execute "ceph osd set norebalance" and wait for recovery I/O to stop.
- Cluster should now be in health warn showing misplaced objects and the OSD flag. There should be no other warning(s).
- Stop one of the "old" OSDs and wait for peering to finish. A lot of PGs will go undersized/degraded.
- Start the OSD again and wait for peering to finish. A lot of PGs will remain undersized/degraded despite the fact that everything is restored and all the missing objects are actually present.

Interestingly, one can manually recover from this by:

- Restoring the crush sub-tree of map A by moving all "new" OSDs out of the crush tree to a different root.
- Wait for peering to complete. At this point, undersized/degraded PGs should already be gone.
- Restore crush map B by moving the "new" OSDs back.
- The cluster status will recover to the status from before shutting down the "old" OSD.

An archive with logs demonstrating the above observations is attached to this case. The log shows the result of restarting an "old" and a "new" OSD. The cluster recovers as expected when a "new" OSD, but not when an "old" OSD is restarted.

Hypothesis:

During a rebalancing operation as described above, computing all possible locations of objects requires more than one crush map as some objects are still placed according to map A while others are already placed according to map B. Ceph either needs to store all crush maps that have active placements, or store all re-mappings based on an old crush map until either (A) the remapped PG has completed rebalancing or (B) an "old" OSD that is part of the source PG is permanently purged from the cluster. In all other circumstances, such re-mappings must be preserved. However, it looks like that these re-mappings are currently removed immediately when an OSD goes down and can then not be recovered, because crush map A is no longer available for re-computing the placements.

An alternative would be that a booting OSD checks if it holds objects that are on the cluster's "wanted list" to restore re-mappings.

Workaround:

If someone is going to execute a larger rebalancing operation, create a crush map A' based on crush map A with all "new" OSDs located under a separate root and save it somewhere. In case redundancy is lost due to a reboot, network outage, etc, use this to perform the redundancy recovery operation as described above in the second part.

logs02.tgz - Logs of OSD restarts and manual rebuild of placement information. (377 KB) Frank Schilder, 08/06/2020 01:50 PM

sess.log View - Session log with annotations. (41.9 KB) Frank Schilder, 08/11/2020 12:04 PM


Related issues

Related to RADOS - Bug #37439: Degraded PG does not discover remapped data on originating OSD Resolved 11/28/2018

History

#1 Updated by Jonas Jelten 4 months ago

  • Related to Bug #37439: Degraded PG does not discover remapped data on originating OSD added

#2 Updated by Jonas Jelten 4 months ago

We have had this problem for a long time, one reason was resolved in #37439. But it still persists in some cases, and I did not dig deep for remaining causes yet.

Can you provide ceph pg $id dump for the pg that is degraded? It will probably say (at the bottom) that it did query the OSD where we expect the data to be, but did not succeed (the might_have_unfound part in pg dump).

What also works to recover is revert the change to the placement, wait for peering, and then restore it. Data will be found, and then only be "misplaced" again.

#3 Updated by Frank Schilder 4 months ago

Thanks a lot for this info. There have been a few more scenarios discussed on the users-list, all involving changes to the crush map (adding OSDs, re-weighting, etc) while OSDs were down or restarting at the same time. In all such cases, objects of EC pools go missing.

My description indeed matches scenario 1 in #37439 and I'm wondering why it still shows up when this particular situation was fixed.

Can you provide ceph pg $id dump for the pg that is degraded? It will probably say (at the bottom)
that it did query the OSD where we expect the data to be, but did not succeed (the might_have_unfound part in pg dump).

I will. Could you please let me know what commands you have in mind here:

What also works to recover is revert the change to the placement, wait for peering, and then restore it. Data will be found, and then only be "misplaced" again.

I would like to try this as a recovery method.

#4 Updated by Frank Schilder 3 months ago

I repeated the experiment and the result is very different from the other case descriptions. Apparently, part of the issue is indeed already fixed. It shows that there is still a problem with discovery of objects of PGs that are in state "...+remapped+backfilling" prior to OSD shut down. The compressed info after shutdown and restart of an "old" OSD is:

# ceph status
            7954306/1498800854 objects misplaced (0.531%)
            Degraded data redundancy: 208493/1498800854 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized

# ceph health detail
    pg 11.a is stuck undersized for 311.488352, current state active+undersized+degraded+remapped+backfilling, last acting [170,156,148,2147483647,234,86,236,232]

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [
      {
        "osd": "74(3)",
        "status": "already probed" 
      }
    ],
    "recovery_progress": {
      "backfill_targets": [
        "289(3)",
        "292(2)" 
      ],

Moving the "new" OSDs out and back into the crush sub-tree results in:

# After moving "new" OSDs out of tree:
# ceph status
            59942033/1498816658 objects misplaced (3.999%)
            1 slow ops, oldest one blocked for 62 sec, mon.ceph-03 has slow ops

# No degraded objects any more, but an operation gets stuck in a monitor (requires restart later).

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [
      {
        "osd": "74(3)",
        "status": "already probed" 
      },
      {
        "osd": "86(5)",
        "status": "already probed" 
      },
      {
        "osd": "148(2)",
        "status": "already probed" 
      },
      {
        "osd": "156(1)",
        "status": "already probed" 
      },
      {
        "osd": "232(7)",
        "status": "already probed" 
      },
      {
        "osd": "234(4)",
        "status": "already probed" 
      },
      {
        "osd": "236(6)",
        "status": "already probed" 
      },
      {
        "osd": "289(3)",
        "status": "not queried" 
      },
      {
        "osd": "292(2)",
        "status": "not queried" 
      }
    ],
    "recovery_progress": {
      "backfill_targets": [],

# After moving "new" OSDs back into tree:
# ceph status
            8630330/1498837232 objects misplaced (0.576%)
            8 slow ops, oldest one blocked for 212 sec, daemons [osd.169,osd.234,osd.288,osd.63,mon.ceph-03] have slow ops.

# Slow OPS show up for some reason. This is somewhat strange, I did not see this during the other peering operations.
# A bit later:

# ceph status
            8630330/1498844491 objects misplaced (0.576%)
            1 slow ops, oldest one blocked for 247 sec, mon.ceph-03 has slow ops

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [],
    "recovery_progress": {
      "backfill_targets": [
        "289(3)",
        "292(2)" 
      ],

Please have a look at the full annotated session log I attached.

Could it be that ceph is loosing track of objects of PGs that are backfilling, that is, PGs that have some objects in the old and some already in the new location? In the top snippet, PG 11.a remains stuck undersized even though it seems to know that OSD 74 should take the empty space. I don't understand why it doesn't come back up complete at this point.

Also available in: Atom PDF