Project

General

Profile

Actions

Bug #46847

open

Loss of placement information on OSD reboot

Added by Frank Schilder over 3 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific,octopus
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During rebalancing after adding new disks to a cluster, the cluster looses placement information on reboot of an "old" OSD. This results in an unnecessary and long-lasting degradation of redundancy. See also: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/C7C4W4APGGQL7VAJP4S3KEGL4RMUN4LZ/

How to reproduce:

- Start with a healthy cluster with crush map A. All OSDs present at this time are referred to as "old" OSDs.
- Add new disks to the cluster, which will now have a new crush map B. All OSDs added in this step are referred to as "new" OSDs.
- Let the rebalancing run for a while, execute "ceph osd set norebalance" and wait for recovery I/O to stop.
- Cluster should now be in health warn showing misplaced objects and the OSD flag. There should be no other warning(s).
- Stop one of the "old" OSDs and wait for peering to finish. A lot of PGs will go undersized/degraded.
- Start the OSD again and wait for peering to finish. A lot of PGs will remain undersized/degraded despite the fact that everything is restored and all the missing objects are actually present.

Interestingly, one can manually recover from this by:

- Restoring the crush sub-tree of map A by moving all "new" OSDs out of the crush tree to a different root.
- Wait for peering to complete. At this point, undersized/degraded PGs should already be gone.
- Restore crush map B by moving the "new" OSDs back.
- The cluster status will recover to the status from before shutting down the "old" OSD.

An archive with logs demonstrating the above observations is attached to this case. The log shows the result of restarting an "old" and a "new" OSD. The cluster recovers as expected when a "new" OSD, but not when an "old" OSD is restarted.

Hypothesis:

During a rebalancing operation as described above, computing all possible locations of objects requires more than one crush map as some objects are still placed according to map A while others are already placed according to map B. Ceph either needs to store all crush maps that have active placements, or store all re-mappings based on an old crush map until either (A) the remapped PG has completed rebalancing or (B) an "old" OSD that is part of the source PG is permanently purged from the cluster. In all other circumstances, such re-mappings must be preserved. However, it looks like that these re-mappings are currently removed immediately when an OSD goes down and can then not be recovered, because crush map A is no longer available for re-computing the placements.

An alternative would be that a booting OSD checks if it holds objects that are on the cluster's "wanted list" to restore re-mappings.

Workaround:

If someone is going to execute a larger rebalancing operation, create a crush map A' based on crush map A with all "new" OSDs located under a separate root and save it somewhere. In case redundancy is lost due to a reboot, network outage, etc, use this to perform the redundancy recovery operation as described above in the second part.


Files

logs02.tgz (377 KB) logs02.tgz Logs of OSD restarts and manual rebuild of placement information. Frank Schilder, 08/06/2020 01:50 PM
sess.log (41.9 KB) sess.log Session log with annotations. Frank Schilder, 08/11/2020 12:04 PM

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #37439: Degraded PG does not discover remapped data on originating OSDResolved11/28/2018

Actions
Actions #1

Updated by Jonas Jelten over 3 years ago

  • Related to Bug #37439: Degraded PG does not discover remapped data on originating OSD added
Actions #2

Updated by Jonas Jelten over 3 years ago

We have had this problem for a long time, one reason was resolved in #37439. But it still persists in some cases, and I did not dig deep for remaining causes yet.

Can you provide ceph pg $id dump for the pg that is degraded? It will probably say (at the bottom) that it did query the OSD where we expect the data to be, but did not succeed (the might_have_unfound part in pg dump).

What also works to recover is revert the change to the placement, wait for peering, and then restore it. Data will be found, and then only be "misplaced" again.

Actions #3

Updated by Frank Schilder over 3 years ago

Thanks a lot for this info. There have been a few more scenarios discussed on the users-list, all involving changes to the crush map (adding OSDs, re-weighting, etc) while OSDs were down or restarting at the same time. In all such cases, objects of EC pools go missing.

My description indeed matches scenario 1 in #37439 and I'm wondering why it still shows up when this particular situation was fixed.

Can you provide ceph pg $id dump for the pg that is degraded? It will probably say (at the bottom)
that it did query the OSD where we expect the data to be, but did not succeed (the might_have_unfound part in pg dump).

I will. Could you please let me know what commands you have in mind here:

What also works to recover is revert the change to the placement, wait for peering, and then restore it. Data will be found, and then only be "misplaced" again.

I would like to try this as a recovery method.

Actions #4

Updated by Frank Schilder over 3 years ago

I repeated the experiment and the result is very different from the other case descriptions. Apparently, part of the issue is indeed already fixed. It shows that there is still a problem with discovery of objects of PGs that are in state "...+remapped+backfilling" prior to OSD shut down. The compressed info after shutdown and restart of an "old" OSD is:

# ceph status
            7954306/1498800854 objects misplaced (0.531%)
            Degraded data redundancy: 208493/1498800854 objects degraded (0.014%), 3 pgs degraded, 3 pgs undersized

# ceph health detail
    pg 11.a is stuck undersized for 311.488352, current state active+undersized+degraded+remapped+backfilling, last acting [170,156,148,2147483647,234,86,236,232]

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [
      {
        "osd": "74(3)",
        "status": "already probed" 
      }
    ],
    "recovery_progress": {
      "backfill_targets": [
        "289(3)",
        "292(2)" 
      ],

Moving the "new" OSDs out and back into the crush sub-tree results in:

# After moving "new" OSDs out of tree:
# ceph status
            59942033/1498816658 objects misplaced (3.999%)
            1 slow ops, oldest one blocked for 62 sec, mon.ceph-03 has slow ops

# No degraded objects any more, but an operation gets stuck in a monitor (requires restart later).

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [
      {
        "osd": "74(3)",
        "status": "already probed" 
      },
      {
        "osd": "86(5)",
        "status": "already probed" 
      },
      {
        "osd": "148(2)",
        "status": "already probed" 
      },
      {
        "osd": "156(1)",
        "status": "already probed" 
      },
      {
        "osd": "232(7)",
        "status": "already probed" 
      },
      {
        "osd": "234(4)",
        "status": "already probed" 
      },
      {
        "osd": "236(6)",
        "status": "already probed" 
      },
      {
        "osd": "289(3)",
        "status": "not queried" 
      },
      {
        "osd": "292(2)",
        "status": "not queried" 
      }
    ],
    "recovery_progress": {
      "backfill_targets": [],

# After moving "new" OSDs back into tree:
# ceph status
            8630330/1498837232 objects misplaced (0.576%)
            8 slow ops, oldest one blocked for 212 sec, daemons [osd.169,osd.234,osd.288,osd.63,mon.ceph-03] have slow ops.

# Slow OPS show up for some reason. This is somewhat strange, I did not see this during the other peering operations.
# A bit later:

# ceph status
            8630330/1498844491 objects misplaced (0.576%)
            1 slow ops, oldest one blocked for 247 sec, mon.ceph-03 has slow ops

# ceph pg 11.a query | jq ".acting,.up,.recovery_state" 
    "might_have_unfound": [],
    "recovery_progress": {
      "backfill_targets": [
        "289(3)",
        "292(2)" 
      ],

Please have a look at the full annotated session log I attached.

Could it be that ceph is loosing track of objects of PGs that are backfilling, that is, PGs that have some objects in the old and some already in the new location? In the top snippet, PG 11.a remains stuck undersized even though it seems to know that OSD 74 should take the empty space. I don't understand why it doesn't come back up complete at this point.

Actions #5

Updated by Jonas Jelten over 3 years ago

Ok, that's the same state I see our PGs in when they become degraded due to remappings (might_have_unfound: already_probed).

The command to restore the peering depends on what caused the remap. E.g. if you took an OSD out, you'd put it in again, wait for rediscovery, and take it out. It seems that the peering/pglog discovery only finds the data if the crush state would place the data exactly where it was before.

The already_probed state seems to hint that somehow the PG states did not match, so the originating PG wasn't considered as source (even though it is the (only) suitable shard source). Ceph knew to query the correct OSD (74), so the rediscovery method is not totally broken.

In your log this only is a problem since "new" OSDs are not the source of the movement, but the "old" OSD is (hence the data degrade).

So I guess we have to stare at the code and see why the probing doesn't succeed :(

Actions #6

Updated by Frank Schilder about 3 years ago

Thanks for getting back on this. Your observations are exactly what I see as well. A note about severity of this bug. When looking around for reasons for data loss on ceph, the number one is size=2, min_size=1 replication. The problem reported here is actually number 2! The most recent case I know of is: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA . However, there are more, users just don't realise that they were bitten by incomplete object discovery.

The toxic ingredients are:

- EC pools (in particular with large replication factors like 8+2, 8+3)
- adding new OSDs/rebalancing
- power outage (or many simultaneous reboots)

After such an incident, some PGs tend to remain incomplete. If OSDs then start erasing unreferenced objects, data loss occurs.

To be prepared, when adding OSDs I now use an extended procedure:

- add new OSDs outside the destination tree
- save the crush map (getcrushmap) to "map-before"
- set norebalance, ...
- move all OSDs to destination and wait for peering to finish
- save the crush map (getcrushmap) to "map-after"
- unset norebalance, ...

This way, I can very quickly switch between the before and after crush map in case something goes wrong.

I really hope you find the reason.

Actions #7

Updated by Jonas Jelten about 3 years ago

Given the "severity" I'd be really glad if some of the Ceph core devs could have a look at this :) I'm really not that familiar with the codebase.

Actions #8

Updated by Dan van der Ster about 3 years ago

  • Target version deleted (v13.2.11)
  • Affected Versions v14.2.19 added

I think that this issue is the same as what I observed in https://tracker.ceph.com/issues/49104

We reproduce this any time we drain a machine (set crush weight to 0 for a server) and then if any of those draining osds flap, the PGs they host will remain undersized+degraded, until we re-peer the PG (e.g. by `ceph osd down <primary`).

Actions #9

Updated by Josh Durgin about 3 years ago

  • Project changed from Ceph to RADOS
  • Priority changed from Normal to High
Actions #10

Updated by Dan van der Ster about 3 years ago

I'm working on a test reproducer here: https://github.com/ceph/ceph/pull/40849

Actions #11

Updated by Jonas Jelten almost 3 years ago

It seems it does find the data when I issue ceph pg repeer $pgid. Observed on MON 14.2.21 with all OSDs 14.2.15.

Actions #12

Updated by Frank Schilder almost 3 years ago

I had a look at the reproducer and am not entirely sure if it is equivalent to the problem discussed here. it might be yet another situation where peering doesn't complete. I looked at the code of the function TEST_flap_ec and would like to ask if an additional test with the following modifications can be added to the suite:

function TEST_flap_ec() {
    [...]
    # set norebalance so that we don't backfill
    ceph osd set norebalance
    wait_for_clean

    # add new OSDs
    run_osd $dir 5
    run_osd $dir 6
    wait_for_peered # wait_for_clean??

    # We now have old osds=0,...,4 and
    # new OSDs 5,6. Flapping an old osd leads to degraded objects.
    # flap osd.0 while rebalancing
    ceph osd unset norebalance
    sleep 10 # let rebalancing progress a bit
    ceph osd down 0
    wait_for_osd up 0
    wait_for_peered

    # now there should be zero undersized or degraded pgs
    # I don't recall if I saw degraded PGs or only degraded objects.
    ceph pg debug degraded_pgs_exist | grep -q FALSE
}

This is the situation where I observed non-recoverable object degradation. A similar test could be added for replicated pools.

Actions #13

Updated by Dan van der Ster almost 3 years ago

Last week we had a power outage affecting all OSD machines in a 14.2.20 cluster. A small percentage of PGs didn't activate after the machines all booted: we had 40 undersized+degraded+remapped+backfilling+peered and 7 undersized+degraded+remapped+backfill_wait+peered.

All of those PGs activated after we issued `ceph pg repeer $pgid` on them.

Actions #14

Updated by Neha Ojha about 2 years ago

  • Status changed from New to Need More Info

Is this issue reproducible in Octopus or later?

Actions #15

Updated by Malcolm Haak about 2 years ago

Neha Ojha wrote:

Is this issue reproducible in Octopus or later?

Yes. I hit it last night. It's minced one of my ec pools quite nicely

Actions #16

Updated by Malcolm Haak about 2 years ago

Sorry I should add some context/data

ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)

15 OSDs
Pool is 4+2 EC

Was doing a re-balance. During this, OSD's started crashing. Systemd restarted them, but they would just crash again. Got a few different errors. Some were asserts in PG_log others were something about permissions?
Anyway I restarted the whole cluster and it came back up and stopped crashing.

But by this point I've got

[WRN] OBJECT_UNFOUND: 2880/9433453 objects unfound (0.031%)
    pg 9.95 has 19 unfound objects
    pg 9.92 has 32 unfound objects
    pg 9.84 has 35 unfound objects
    pg 9.8e has 35 unfound objects
    pg 9.88 has 23 unfound objects
    pg 9.33 has 1 unfound objects
    pg 9.3c has 17 unfound objects
    pg 9.3d has 132 unfound objects
    pg 9.3e has 80 unfound objects
    pg 9.3f has 35 unfound objects
    pg 9.39 has 235 unfound objects
    pg 9.24 has 190 unfound objects
    pg 9.20 has 31 unfound objects
    pg 9.22 has 1 unfound objects
    pg 9.2d has 10 unfound objects
    pg 9.28 has 215 unfound objects
    pg 9.29 has 1 unfound objects
    pg 9.2 has 118 unfound objects
    pg 9.1a has 121 unfound objects
    pg 9.18 has 23 unfound objects
    pg 9.1f has 65 unfound objects
    pg 9.1e has 30 unfound objects
    pg 9.1c has 23 unfound objects
    pg 9.13 has 154 unfound objects
    pg 9.30 has 168 unfound objects
    pg 9.36 has 60 unfound objects
    pg 9.34 has 8 unfound objects
    pg 9.4e has 30 unfound objects
    pg 9.47 has 17 unfound objects
    pg 9.5e has 4 unfound objects
    pg 9.51 has 32 unfound objects
    pg 9.56 has 2 unfound objects
    pg 9.54 has 50 unfound objects
    pg 9.6a has 33 unfound objects
    pg 9.6d has 1 unfound objects
    pg 9.61 has 116 unfound objects
    pg 9.67 has 76 unfound objects
    pg 9.65 has 274 unfound objects
    pg 9.79 has 111 unfound objects
    pg 9.7f has 116 unfound objects
    pg 9.7c has 36 unfound objects
    pg 9.71 has 120 unfound objects

Glad I didn't move the data I cared about into EC.

Actions #17

Updated by Malcolm Haak about 2 years ago

So in even more fun news, I created the EC pool according to the instructions provided in the documentation.

It's set the min_size on a pool with a k=6 to 3.

So that's fun default.

It's also set the min_size on a pool with a k=10 to 3 as well.

I'm hoping this behavior has changed.

Actions #18

Updated by Neha Ojha about 2 years ago

Can you share your ec profile and the output of "ceph osd pool ls detail"?

Actions #19

Updated by Frank Schilder about 2 years ago

Could somebody please set the status back to open and Affected Versions to all?

Actions #20

Updated by Neha Ojha about 2 years ago

  • Backport set to quincy,pacific,octopus

Frank Schilder wrote:

Could somebody please set the status back to open and Affected Versions to all?

The ticket is open and "Need More Info" means that we are waiting for data requested in https://tracker.ceph.com/issues/46847#note-18. I have also updated the backport field appropriately.

Actions #21

Updated by Malcolm Haak about 2 years ago

Neha Ojha wrote:

Frank Schilder wrote:

Could somebody please set the status back to open and Affected Versions to all?

The ticket is open and "Need More Info" means that we are waiting for data requested in https://tracker.ceph.com/issues/46847#note-18. I have also updated the backport field appropriately.

Just had a power outage. It was doing a rebalance. I've now got recovery_unfound pg's. I understand that it might be unhappy about that but actually losing and being unable to recover from where it was seems problematic

ceph osd pool ls detail

pool 2 'cephfs_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 400 pgp_num 400 autoscale_mode warn last_change 369876 lfor 14606/14606/282614 flags hashpspool,nearfull stripe_width 0 compression_mode aggressive application cephfs
pool 3 'cephfs_metadata' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 369876 lfor 0/350719/350717 flags hashpspool,nearfull stripe_width 0 application cephfs
pool 5 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 369876 lfor 14610/362720/362718 flags hashpspool,nearfull,selfmanaged_snaps stripe_width 0 compression_mode aggressive application rbd
pool 7 'kube' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 369876 lfor 0/332913/332911 flags hashpspool,nearfull,selfmanaged_snaps stripe_width 0 compression_mode aggressive application rbd
pool 8 'single_fs' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 369876 lfor 0/280013/280011 flags hashpspool,nearfull stripe_width 0 application cephfs
pool 9 'ec_cephfs' erasure profile malsprofile size 8 min_size 6 crush_rule 4 object_hash rjenkins pg_num 150 pgp_num 150 autoscale_mode off last_change 369876 lfor 0/286579/286720 flags hashpspool,ec_overwrites,nearfull stripe_width 24576 application cephfs

this is the ec profile in use

ceph osd erasure-code-profile get malsprofile
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Actions #22

Updated by Malcolm Haak about 2 years ago

Oh also ceph pg repeer has not totally worked. I have a single object remaining unfound.

  data:
    pools:   6 pools, 678 pgs
    objects: 9.43M objects, 32 TiB
    usage:   52 TiB used, 6.0 TiB / 58 TiB avail
    pgs:     7/50639976 objects degraded (0.000%)
             34578/50639976 objects misplaced (0.068%)
             1/9427326 objects unfound (0.000%)
             673 active+clean
             3   active+clean+inconsistent
             1   active+recovery_unfound+degraded
             1   active+remapped+backfilling

Actions #23

Updated by Radoslaw Zarzynski about 2 years ago

Please notice the min_size is set k (6), so there is little redundancy guaranteed.
Anyway, are you able to reproduce / do you have any logs from this occurrence?

Actions #24

Updated by Yao Ning almost 2 years ago

we also encounter the similar issue and when ecpool during rebalance; sometime (osd overload or pg peering crash), then after the osd automatically reboot, pg becomes undersize + degrade and orignial pg shard on the osd disappeared during peering.

2022-05-18 09:38:00.162 7fd4860e2700 1 osd.1011 pg_epoch: 89797 pg[4.bc3s2( v 89660'793269 (88142'790192,89660'793269] local-lis/les=86475/86476 n=73239 ec=56866/872 lis/c 86475/83075 les/c/f 86476/83486/61684 86475/89797/84889) [328,869,518,258,288,559,75,281,858,269,699,74,1018,796,462,206,765,340]/[743,526,2147483647,229,676,514,52,288,934,57,598,440,938,99,522,41,180,430]p743(0) r=-1 lpr=89797 pi=[83075,89797)/1 luod=0'0 crt=89660'793269 active+remapped mbc={}] start_peering_interval up [328,869,518,258,288,559,75,281,858,269,699,74,1018,796,462,206,765,340] -> [328,869,518,258,288,559,75,281,858,269,699,74,1018,796,462,206,765,340], acting [743,526,1011,229,676,514,52,288,934,57,598,440,938,99,522,41,180,430] -> [743,526,2147483647,229,676,514,52,288,934,57,598,440,938,99,522,41,180,430], acting_primary 743(0) -> 743, up_primary 328(0) -> 328, role 2 -> -1, features acting 4611087854035861503 upacting 4611087854035861503

originally pg shard on osd.1011, but after osd.1011 suddenlly reboot and repeer the pg, actingset says pg shard on osd.1011 becomes unavailable. This is unreasonable since the pglog must be consective during such a short downtime. and pg must become a recovering on this osd.

so is that possible when a pg during backfilling, after one osd restarting, this osd cannot enter a recovering state. Instead, it also enters backfilling even if just write one object into this pg during osd downtime.

Actions #25

Updated by Radoslaw Zarzynski almost 2 years ago

Notes from the bug scrub:

1. There is a theoretical way to enter backfill instead of recovery in such a scenario.
2. To exclude it, logs from osd.743, for the same period (the transition to the backfill), would very useful.

Actions #26

Updated by Frank Schilder over 1 year ago

The PR https://github.com/ceph/ceph/pull/40849 for adding the test was marked stale. I left a comment and it would be great if someone could have a look at it. I am actually not convinced that it is necessary to wait for a fix of a warning message in order to add this test to the suite. All versions will fail any ways until the real bug is found and fixed. I don't see a good reason for this hold up. In particular, given the severity of this long-standing issue.

In addition, my recent testing shows that there is a new related bug #56995 that was introduced with nautilus or octopus. It is present in octopus 15.2.16/17 (and almost certainly all later) versions and creates a similarly dangerous PG degradation that leads to a random loss of placement information and service outage right away for no apparent reason. This problem is not present in mimic 13.2.10.

I'm also noticing that no bug scrub seems to be going on for quite a while now. I submitted a couple of tracker items and, usually, they would at least become assigned to a project after no more than a week. This hasn't happened for months now and I'm wondering what the reason might be.

Actions

Also available in: Atom PDF