Project

General

Profile

Bug #39555

backfill_toofull while OSDs are not full (Unneccessary HEALTH_ERR)

Added by Rene Diepstraten 4 months ago. Updated about 8 hours ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Backfill/Recovery
Target version:
-
Start date:
05/01/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic, nautilus
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:

Description

This week I ran into an issue where ceph reports HEALTH_ERR because pgs are backfill_toofull.
None of the OSDs are over the threshold of nearfull_ratio.

The problem was as far as I know first reported in the ceph users mailing list in January:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032534.html
But more posts can be found ( http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033569.html for example )

Currently, the cluster I'm working on is busy with a massive rebalance and the fullest osd is at 79% full.
There are still more pgs remapped to that osd but there are also pgs remapped away from this osd.

root@mon01:~# ceph osd df tree | sort -nk11 | tail -1 | awk '{print $11}'
79.33
root@mon01:~# ceph daemon mon.mon01 config get mon_osd_backfillfull_ratio
{
    "mon_osd_backfillfull_ratio": "0.900000" 
}

It appears to me as if the monitors count all pgs that are remapped to the osd and if added to the current capacity it would be over the
backfillfull_ratio, reports the backfill_toofull state. It should in my opinion also take into account the pgs that are remapped away from the osd if this is indeed the case.

ceph-osd-df.txt View - ceph osd df (4 KB) Alex Cucu, 05/10/2019 12:05 PM

ceph-df.txt View - ceph df (629 Bytes) Alex Cucu, 05/10/2019 12:05 PM

crushmap.txt View (4.3 KB) Alex Cucu, 05/10/2019 12:10 PM

ceph-osd-tree.txt View (3.56 KB) Alex Cucu, 05/10/2019 12:11 PM

crushmap.txt View (7.44 KB) Rene Diepstraten, 05/21/2019 09:33 PM

ceph-osd-df-tree.txt View (8.77 KB) Rene Diepstraten, 05/21/2019 09:33 PM


Related issues

Related to RADOS - Bug #41255: backfill_toofull seen on cluster where the most full OSD is at 1% New 08/14/2019

History

#1 Updated by Greg Farnum 4 months ago

The OSD can't count PGs being evacuated as if they were gone because something could go wrong. So it's stuck seeing if the would-be-backfilling PG size plus what's already on disk is too large.

Do you think you're seeing behavior different from that?

#2 Updated by Wido den Hollander 4 months ago

Greg Farnum wrote:

The OSD can't count PGs being evacuated as if they were gone because something could go wrong. So it's stuck seeing if the would-be-backfilling PG size plus what's already on disk is too large.

Do you think you're seeing behavior different from that?

I assume this is new in Mimic? As with Luminous this wouldn't happen.

So you need to wait for other PGs to be evacuated from the OSD before it starts any other backfill.

#3 Updated by Rene Diepstraten 4 months ago

This may be the case indeed, but I'd expect that unless pgs are evacuated, the state would be backfill_wait, not backfill_toofull .
It simply needs to wait until the expected space is enough to start backfilling.
With any larger rebalance on a cluster that is somewhat fuller, this will trigger HEALTH_ERR, thus causing quite a lot of monitoring systems to go off.
I have never seen this in luminous and I think this behaviour is confusing and undesirable.
If this is intended behaviour, this issue might need to be changed from bug to improvement.

#4 Updated by David Zafman 3 months ago

The backfull_toofull state is like backfill_wait except that it indicates the reason that backfill can not proceed right now. In the scenario where OSDs are lost and the PGs remap to OSDs that don't have enough capacity, the cluster might be stuck with PGs in backfill_toofull. This is why backfill_toofull should cause some kind of health condition. Unfortunately, we can't take into account PGs that are mapped away because the order of backfill is unpredictable. It is also possible that a backfill_wait for a PG mapped away turns to backfill_toofull due to space utilization on that PG's target OSDs. Out of an abundance of caution the code errors on the side of triggering backfill_toofull because there is a danger if an OSD truly runs out of space.

I would say you are arguing here that a backfill_toofull or recovery_toofull should be a HEALTH_WARN instead of a HEALTH_ERR.

We didn't think of the case where this is a transient condition triggering HEALTH_ERR. But the cluster could be deadlocked until new disks are brought online or OSDs brought back, so HEALTH_ERR seems appropriate in that case.

#5 Updated by David Zafman 3 months ago

  • Assignee set to David Zafman

#6 Updated by Alex Cucu 3 months ago

I'm having this issue on Nautilus (14.2.1) but there is no way OSDs can be full as I'm not using more than 6.8% raw storage and there is no OSD with usage above 15%.

It happened after I've added the 4th node.

#7 Updated by Erik Lindahl 3 months ago

Just to chime in: We too have seen this (on 13.2.5) on OSDs that are only 10-20% full.

It always (magically) clears a while later, so it's obviously something waiting that eventually gets a green light, but based on the behaviour there seems to be something wrong with how the threshold is calculated.

#8 Updated by David Zafman 3 months ago

Erik:
New code that estimates and reserves final backfill space requirement is not present in v13.2.5. It isn't clear if what you are seeing in Mimic is related.

Alex:
  1. Is osd_max_backfills configured to something other than the default value of 1?
  2. Was the backfill full ratio changed using "ceph osd set-backfillfull-ratio ##"? What is the current value?
    You can't use conf variable mon_osd_backfillfull_ratio to adjust the backfillfull_ratio.
  3. Are only replicated or erasure coded pools seeing the backfill_toofull?
  4. Are you seeing scrub errors with large difference in "bytes"?
    cluster [ERR] #.# scrub : stat mismatch, got 10000/10000 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7569408/6832128 bytes
    If bytes isn't fixed using "repair" an incorrect value would be used to determine backfill space requirements of a pg.

#9 Updated by David Zafman 3 months ago

Seeing a "ceph osd df" like Alex provided is helpful in determining what is going on. Looking at it repeatedly while there are PGs backfilling/backfill_toofull may provide clues.

#10 Updated by Rene Diepstraten 3 months ago

Here I can reproduce the issue with the ssd class.
We're in the process of reinstalling/redeploying (one host with all osds marked out, one reinstalled host with new osds added).
The ssds are used for .rgw.buckets.index (mainly omap keys). When adding one osd and marking out two osds, the issue occurs.
It quickly solves itself during rebalancing.
Added crushmap and ceph osd df tree as well.

#11 Updated by David Zafman 3 months ago

Rene, you have what looks more like an expected situation. With some OSDs showing as high as 72% utilization, a big rearrangement could easily try to backfill that can't fit until another PG is moved elsewhere.

Check over my list in https://tracker.ceph.com/issues/39555#note-8

#12 Updated by Rene Diepstraten 3 months ago

I know that, but the issue doesn't occur with those osds. The issue occurs with the ssds (checked with `ceph pg ls backfill_toofull`) .
The backfill_toofull pgs are in the pool mapped to the class ssd osds, which are not full at all.

#13 Updated by David Zafman 3 months ago

  • Pull request ID set to 28204

#14 Updated by David Zafman 3 months ago

The pull request https://github.com/ceph/ceph/pull/28204 generates a warning with a better message.

health: HEALTH_WARN
Low space blocking backfill (add storage if this doesn't resolve itself): 4 pgs backfill_toofull

#15 Updated by David Zafman 3 months ago

  • Backport set to luminous, mimic, nautilus

#16 Updated by David Zafman 3 months ago

  • Status changed from New to In Progress

#17 Updated by Rene Diepstraten 3 months ago

Thanks for the PR.
The problem itself seems to be caused as follows:
- A backfill starts to a set osds
- One of the up osds is marked down
- The backfill is set to state backfill_toofull

2019-05-25 08:38:27.040147 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796775 : cluster [WRN] Health check update: 56277851/400915587 objects misplaced (14.037%) (OBJECT_MISPLACED)
2019-05-25 08:38:32.040856 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796776 : cluster [WRN] Health check update: 56273116/400915587 objects misplaced (14.036%) (OBJECT_MISPLACED)
2019-05-25 08:38:33.961543 osd.326 osd.326 [2a05:xx::f130]:6809/151061 1558 : cluster [WRN] Monitor daemon marked osd.326 down, but it is still running
2019-05-25 08:38:35.404995 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796782 : cluster [WRN] Health check failed: Degraded data redundancy: 55546/400915587 objects degraded (0.014%), 3 pgs degraded (PG_DEGRADED)
2019-05-25 08:38:37.041530 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796783 : cluster [WRN] Health check update: 56088614/400915587 objects misplaced (13.990%) (OBJECT_MISPLACED)
2019-05-25 08:38:37.410998 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796784 : cluster [ERR] Health check failed: Degraded data redundancy (low space): 1 pg backfill_toofull (PG_DEGRADED_FULL)

I've got multiple of these in the logs over the past incidents. Sometimes because I marked them down myself, sometimes because an osd is too busy and marked down.

Under luminous I've never seen this issue.
I've also checked whether or not I could clear the backfill_toofull by using upmap to map one of the osds in the pg to another, and then removing the upmap entry again and this works:

pg     primary  acting       up           state
8.1db  55       [19,333,319] [55,19,43]   active+remapped+backfill_wait+backfill_toofull

Upmap to another osd:

ceph osd pg-upmap-items 8.1db 55 56

Wait until after peering and remove upmap entry again:

ceph osd rm-pg-upmap-items 8.1db

Now it's okay again:

pg     primary  acting       up           state
8.1db  55       [19,333,319] [55,19,43]   active+remapped+backfill_wait           

This strenghtens me into believing this is actually a bug, not behaviour as described earlier by Greg Farnum

#18 Updated by Rene Diepstraten 3 months ago

Note: I've only seen this in a relatively busy and full environment with quite a few backfills going on.

#19 Updated by Maks Kowalik 2 months ago

In our case this issue surfaces on a relatively empty cluster (top used OSDs are around 35%) during a small rebalance (introducing a new empty node with 4 OSDs).
Worth mentioning is that the placement groups reporting backfill_toofull status are not related to the new OSDs (the new IDs are 35,36,37,66)

ceph pg ls backfill_toofull
PG     OBJECTS DEGRADED MISPLACED UNFOUND BYTES       LOG  STATE                                          STATE_STAMP                VERSION      REPORTED      UP                                         ACTING                                     SCRUB_STAMP                DEEP_SCRUB_STAMP           
14.6a     8842        0     26526       0 37025931264 3095 active+remapped+backfill_wait+backfill_toofull 2019-06-13 13:28:41.928664 72408'752455 72408:2271825    [83,135,20,111,139,53,52,145,151,32]p83    [83,34,20,111,139,53,152,145,151,55]p83 2019-06-12 11:15:08.155162 2019-06-05 14:09:19.500308 
14.e8     8644        0     34576       0 36186157091 3016 active+remapped+backfill_wait+backfill_toofull 2019-06-13 13:28:44.951664 72404'747983 72408:2304265 [138,143,54,72,154,149,111,134,146,32]p138 [138,143,111,140,154,32,47,134,146,59]p138 2019-06-12 15:28:55.513072 2019-06-12 15:28:55.513072 
14.196    8797        0     17594       0 36825878528 3024 active+remapped+backfill_wait+backfill_toofull 2019-06-13 13:28:44.751238 72404'751189 72407:2373715   [74,150,135,152,142,62,59,147,109,19]p74   [74,150,135,152,142,53,59,147,109,21]p74 2019-06-13 11:24:00.739785 2019-06-12 07:11:31.756319

ceph osd df|sort -nk 11|tail -5
 27   ssd 1.78139  1.00000 1.8 TiB 633 GiB 1.2 TiB 34.70 1.37 109 
  2   ssd 1.78139  1.00000 1.8 TiB 646 GiB 1.2 TiB 35.40 1.40 114 
 65   ssd 1.86209  1.00000 1.9 TiB 677 GiB 1.2 TiB 35.48 1.40 118 
 28   ssd 1.78139  1.00000 1.8 TiB 655 GiB 1.1 TiB 35.89 1.41 116 
 16   ssd 1.78139  1.00000 1.8 TiB 656 GiB 1.1 TiB 35.94 1.42 120

#20 Updated by Joao Eduardo Luis about 1 month ago

The pull request provided with the fix has been merged (https://github.com/ceph/ceph/pull/28204). Does anyone still see this happening since the merge?

@david: is this something we can close?

#21 Updated by Nathan Cutler about 1 month ago

  • Status changed from In Progress to Pending Backport

#22 Updated by Nathan Cutler about 1 month ago

  • Status changed from Pending Backport to In Progress

oops, reverting - I had not seen Joao's question

#23 Updated by Vikhyat Umrao 1 day ago

  • Related to Bug #41255: backfill_toofull seen on cluster where the most full OSD is at 1% added

#24 Updated by Vikhyat Umrao 1 day ago

Joao Eduardo Luis wrote:

The pull request provided with the fix has been merged (https://github.com/ceph/ceph/pull/28204). Does anyone still see this happening since the merge?

Looks like we have new report - https://tracker.ceph.com/issues/41255 I think HEALTH_ERR is one of the issues but the main concern is backfill_toofull should not be seen on a cluster where the most full OSD is at max 1% usage.

#25 Updated by David Zafman about 8 hours ago

  • Status changed from In Progress to Pending Backport

Also available in: Atom PDF