Project

General

Profile

Bug #46847

Loss of placement information on OSD reboot

Added by Frank Schilder 7 months ago. Updated 18 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

During rebalancing after adding new disks to a cluster, the cluster looses placement information on reboot of an "old" OSD. This results in an unnecessary and long-lasting degradation of redundancy. See also: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/C7C4W4APGGQL7VAJP4S3KEGL4RMUN4LZ/

How to reproduce:

- Start with a healthy cluster with crush map A. All OSDs present at this time are referred to as "old" OSDs.
- Add new disks to the cluster, which will now have a new crush map B. All OSDs added in this step are referred to as "new" OSDs.
- Let the rebalancing run for a while, execute "ceph osd set norebalance" and wait for recovery I/O to stop.
- Cluster should now be in health warn showing misplaced objects and the OSD flag. There should be no other warning(s).
- Stop one of the "old" OSDs and wait for peering to finish. A lot of PGs will go undersized/degraded.
- Start the OSD again and wait for peering to finish. A lot of PGs will remain undersized/degraded despite the fact that everything is restored and all the missing objects are actually present.

Interestingly, one can manually recover from this by:

- Restoring the crush sub-tree of map A by moving all "new" OSDs out of the crush tree to a different root.
- Wait for peering to complete. At this point, undersized/degraded PGs should already be gone.
- Restore crush map B by moving the "new" OSDs back.
- The cluster status will recover to the status from before shutting down the "old" OSD.

An archive with logs demonstrating the above observations is attached to this case. The log shows the result of restarting an "old" and a "new" OSD. The cluster recovers as expected when a "new" OSD, but not when an "old" OSD is restarted.

Hypothesis:

During a rebalancing operation as described above, computing all possible locations of objects requires more than one crush map as some objects are still placed according to map A while others are already placed according to map B. Ceph either needs to store all crush maps that have active placements, or store all re-mappings based on an old crush map until either (A) the remapped PG has completed rebalancing or (B) an "old" OSD that is part of the source PG is permanently purged from the cluster. In all other circumstances, such re-mappings must be preserved. However, it looks like that these re-mappings are currently removed immediately when an OSD goes down and can then not be recovered, because crush map A is no longer available for re-computing the placements.

An alternative would be that a booting OSD checks if it holds objects that are on the cluster's "wanted list" to restore re-mappings.

Workaround:

If someone is going to execute a larger rebalancing operation, create a crush map A' based on crush map A with all "new" OSDs located under a separate root and save it somewhere. In case redundancy is lost due to a reboot, network outage, etc, use this to perform the redundancy recovery operation as described above in the second part.

logs02.tgz - Logs of OSD restarts and manual rebuild of placement information. (377 KB) Frank Schilder, 08/06/2020 01:50 PM

sess.log View - Session log with annotations. (41.9 KB) Frank Schilder, 08/11/2020 12:04 PM


Related issues

Related to RADOS - Bug #37439: Degraded PG does not discover remapped data on originating OSD Resolved 11/28/2018

History

#1 Updated by Jonas Jelten 7 months ago

  • Related to Bug #37439: Degraded PG does not discover remapped data on originating OSD added

#2 Updated by Jonas Jelten 7 months ago

We have had this problem for a long time, one reason was resolved in #37439. But it still persists in some cases, and I did not dig deep for remaining causes yet.

Can you provide ceph pg $id dump for the pg that is degraded? It will probably say (at the bottom) that it did query the OSD where we expect the data to be, but did not succeed (the might_have_unfound part in pg dump).

What also works to recover is revert the change to the placement, wait for peering, and then restore it. Data will be found, and then only be "misplaced" again.

#3 Updated by Frank Schilder 7 months ago

Thanks a lot for this info. There have been a few more scenarios discussed on the users-list, all involving changes to the crush map (adding OSDs, re-weighting, etc) while OSDs were down or restarting at the same time. In all such cases, objects of EC pools go missing.

My description indeed matches scenario 1 in #37439 and I'm wondering why it still shows up when this particular situation was fixed.

Can you provide ceph pg $id dump for the pg that is degraded? It will probably say (at the bottom)
that it did query the OSD where we expect the data to be, but did not succeed (the might_have_unfound part in pg dump).

I will. Could you please let me know what commands you have in mind here:

What also works to recover is revert the change to the placement, wait for peering, and then restore it. Data will be found, and then only be "misplaced" again.

I would like to try this as a recovery method.

#4 Updated by Frank Schilder 7 months ago

I repeated the experiment and the result is very different from the other case descriptions. Apparently, part of the issue is indeed already fixed. It shows that there is still a problem with discovery of objects of PGs that are in state "...+remapped+backfilling" prior to OSD shut down. The compressed info after shutdown and restart of an "old" OSD is:

# ceph status
            7954306/1498800854 objects misplaced (0.531%)
            Degraded data redundancy: 208493/1498800854 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized

# ceph health detail
    pg 11.a is stuck undersized for 311.488352, current state active+undersized+degraded+remapped+backfilling, last acting [170,156,148,2147483647,234,86,236,232]

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [
      {
        "osd": "74(3)",
        "status": "already probed" 
      }
    ],
    "recovery_progress": {
      "backfill_targets": [
        "289(3)",
        "292(2)" 
      ],

Moving the "new" OSDs out and back into the crush sub-tree results in:

# After moving "new" OSDs out of tree:
# ceph status
            59942033/1498816658 objects misplaced (3.999%)
            1 slow ops, oldest one blocked for 62 sec, mon.ceph-03 has slow ops

# No degraded objects any more, but an operation gets stuck in a monitor (requires restart later).

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [
      {
        "osd": "74(3)",
        "status": "already probed" 
      },
      {
        "osd": "86(5)",
        "status": "already probed" 
      },
      {
        "osd": "148(2)",
        "status": "already probed" 
      },
      {
        "osd": "156(1)",
        "status": "already probed" 
      },
      {
        "osd": "232(7)",
        "status": "already probed" 
      },
      {
        "osd": "234(4)",
        "status": "already probed" 
      },
      {
        "osd": "236(6)",
        "status": "already probed" 
      },
      {
        "osd": "289(3)",
        "status": "not queried" 
      },
      {
        "osd": "292(2)",
        "status": "not queried" 
      }
    ],
    "recovery_progress": {
      "backfill_targets": [],

# After moving "new" OSDs back into tree:
# ceph status
            8630330/1498837232 objects misplaced (0.576%)
            8 slow ops, oldest one blocked for 212 sec, daemons [osd.169,osd.234,osd.288,osd.63,mon.ceph-03] have slow ops.

# Slow OPS show up for some reason. This is somewhat strange, I did not see this during the other peering operations.
# A bit later:

# ceph status
            8630330/1498844491 objects misplaced (0.576%)
            1 slow ops, oldest one blocked for 247 sec, mon.ceph-03 has slow ops

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [],
    "recovery_progress": {
      "backfill_targets": [
        "289(3)",
        "292(2)" 
      ],

Please have a look at the full annotated session log I attached.

Could it be that ceph is loosing track of objects of PGs that are backfilling, that is, PGs that have some objects in the old and some already in the new location? In the top snippet, PG 11.a remains stuck undersized even though it seems to know that OSD 74 should take the empty space. I don't understand why it doesn't come back up complete at this point.

#5 Updated by Jonas Jelten 3 months ago

Ok, that's the same state I see our PGs in when they become degraded due to remappings (might_have_unfound: already_probed).

The command to restore the peering depends on what caused the remap. E.g. if you took an OSD out, you'd put it in again, wait for rediscovery, and take it out. It seems that the peering/pglog discovery only finds the data if the crush state would place the data exactly where it was before.

The already_probed state seems to hint that somehow the PG states did not match, so the originating PG wasn't considered as source (even though it is the (only) suitable shard source). Ceph knew to query the correct OSD (74), so the rediscovery method is not totally broken.

In your log this only is a problem since "new" OSDs are not the source of the movement, but the "old" OSD is (hence the data degrade).

So I guess we have to stare at the code and see why the probing doesn't succeed :(

#6 Updated by Frank Schilder 18 days ago

Thanks for getting back on this. Your observations are exactly what I see as well. A note about severity of this bug. When looking around for reasons for data loss on ceph, the number one is size=2, min_size=1 replication. The problem reported here is actually number 2! The most recent case I know of is: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA . However, there are more, users just don't realise that they were bitten by incomplete object discovery.

The toxic ingredients are:

- EC pools (in particular with large replication factors like 8+2, 8+3)
- adding new OSDs/rebalancing
- power outage (or many simultaneous reboots)

After such an incident, some PGs tend to remain incomplete. If OSDs then start erasing unreferenced objects, data loss occurs.

To be prepared, when adding OSDs I now use an extended procedure:

- add new OSDs outside the destination tree
- save the crush map (getcrushmap) to "map-before"
- set norebalance, ...
- move all OSDs to destination and wait for peering to finish
- save the crush map (getcrushmap) to "map-after"
- unset norebalance, ...

This way, I can very quickly switch between the before and after crush map in case something goes wrong.

I really hope you find the reason.

#7 Updated by Jonas Jelten 18 days ago

Given the "severity" I'd be really glad if some of the Ceph core devs could have a look at this :) I'm really not that familiar with the codebase.

Also available in: Atom PDF