Bug #39555: backfill_toofull while OSDs are not full (Unneccessary HEALTH_ERR) - RADOS - Ceph

Actions

Copy link

Bug #39555

closed

backfill_toofull while OSDs are not full (Unneccessary HEALTH_ERR)

Added by Rene Diepstraten almost 5 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

David Zafman

Category:

Backfill/Recovery

Target version:

% Done:

Source:

Tags:

Backport:

luminous, mimic, nautilus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v13.2.0, Ceph - v13.2.1, Ceph - v13.2.2, Ceph - v13.2.3, Ceph - v13.2.4, Ceph - v13.2.5, Ceph - v13.2.6

ceph-qa-suite:

rados

Component(RADOS):

Pull request ID:

28204

Crash signature (v1):

Crash signature (v2):

Description

This week I ran into an issue where ceph reports HEALTH_ERR because pgs are backfill_toofull.
None of the OSDs are over the threshold of nearfull_ratio.

The problem was as far as I know first reported in the ceph users mailing list in January:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032534.html
But more posts can be found ( http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033569.html for example )

Currently, the cluster I'm working on is busy with a massive rebalance and the fullest osd is at 79% full.
There are still more pgs remapped to that osd but there are also pgs remapped away from this osd.

root@mon01:~# ceph osd df tree | sort -nk11 | tail -1 | awk '{print $11}'
79.33
root@mon01:~# ceph daemon mon.mon01 config get mon_osd_backfillfull_ratio
{
    "mon_osd_backfillfull_ratio": "0.900000" 
}

It appears to me as if the monitors count all pgs that are remapped to the osd and if added to the current capacity it would be over the
backfillfull_ratio, reports the backfill_toofull state. It should in my opinion also take into account the pgs that are remapped away from the osd if this is indeed the case.

Files

Download all files

ceph-osd-df.txt (4 KB) ceph-osd-df.txt	ceph osd df	Alex Cucu, 05/10/2019 12:05 PM
ceph-df.txt (629 Bytes) ceph-df.txt	ceph df	Alex Cucu, 05/10/2019 12:05 PM
crushmap.txt (4.3 KB) crushmap.txt		Alex Cucu, 05/10/2019 12:10 PM
ceph-osd-tree.txt (3.56 KB) ceph-osd-tree.txt		Alex Cucu, 05/10/2019 12:11 PM
crushmap.txt (7.44 KB) crushmap.txt		Rene Diepstraten, 05/21/2019 09:33 PM
ceph-osd-df-tree.txt (8.77 KB) ceph-osd-df-tree.txt		Rene Diepstraten, 05/21/2019 09:33 PM

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Greg Farnum almost 5 years ago

The OSD can't count PGs being evacuated as if they were gone because something could go wrong. So it's stuck seeing if the would-be-backfilling PG size plus what's already on disk is too large.

Do you think you're seeing behavior different from that?

Actions

Copy link

Updated by Wido den Hollander almost 5 years ago

Greg Farnum wrote:

The OSD can't count PGs being evacuated as if they were gone because something could go wrong. So it's stuck seeing if the would-be-backfilling PG size plus what's already on disk is too large.

Do you think you're seeing behavior different from that?

I assume this is new in Mimic? As with Luminous this wouldn't happen.

So you need to wait for other PGs to be evacuated from the OSD before it starts any other backfill.

Actions

Copy link

Updated by Rene Diepstraten almost 5 years ago

This may be the case indeed, but I'd expect that unless pgs are evacuated, the state would be backfill_wait, not backfill_toofull .
It simply needs to wait until the expected space is enough to start backfilling.
With any larger rebalance on a cluster that is somewhat fuller, this will trigger HEALTH_ERR, thus causing quite a lot of monitoring systems to go off.
I have never seen this in luminous and I think this behaviour is confusing and undesirable.
If this is intended behaviour, this issue might need to be changed from bug to improvement.

Actions

Copy link

Updated by David Zafman almost 5 years ago

The backfull_toofull state is like backfill_wait except that it indicates the reason that backfill can not proceed right now. In the scenario where OSDs are lost and the PGs remap to OSDs that don't have enough capacity, the cluster might be stuck with PGs in backfill_toofull. This is why backfill_toofull should cause some kind of health condition. Unfortunately, we can't take into account PGs that are mapped away because the order of backfill is unpredictable. It is also possible that a backfill_wait for a PG mapped away turns to backfill_toofull due to space utilization on that PG's target OSDs. Out of an abundance of caution the code errors on the side of triggering backfill_toofull because there is a danger if an OSD truly runs out of space.

I would say you are arguing here that a backfill_toofull or recovery_toofull should be a HEALTH_WARN instead of a HEALTH_ERR.

We didn't think of the case where this is a transient condition triggering HEALTH_ERR. But the cluster could be deadlocked until new disks are brought online or OSDs brought back, so HEALTH_ERR seems appropriate in that case.

Actions

Copy link

Updated by David Zafman almost 5 years ago

Assignee set to David Zafman

Actions

Copy link Download all files

Updated by Alex Cucu almost 5 years ago

File ceph-osd-df.txt ceph-osd-df.txt added
File ceph-df.txt ceph-df.txt added
File crushmap.txt crushmap.txt added
File ceph-osd-tree.txt ceph-osd-tree.txt added

I'm having this issue on Nautilus (14.2.1) but there is no way OSDs can be full as I'm not using more than 6.8% raw storage and there is no OSD with usage above 15%.

It happened after I've added the 4th node.

Actions

Copy link

Updated by Erik Lindahl almost 5 years ago

Just to chime in: We too have seen this (on 13.2.5) on OSDs that are only 10-20% full.

It always (magically) clears a while later, so it's obviously something waiting that eventually gets a green light, but based on the behaviour there seems to be something wrong with how the threshold is calculated.

Actions

Copy link

Updated by David Zafman almost 5 years ago

Erik:
New code that estimates and reserves final backfill space requirement is not present in v13.2.5. It isn't clear if what you are seeing in Mimic is related.

Alex:

Is osd_max_backfills configured to something other than the default value of 1?
Was the backfill full ratio changed using "ceph osd set-backfillfull-ratio ##"? What is the current value?
You can't use conf variable mon_osd_backfillfull_ratio to adjust the backfillfull_ratio.
Are only replicated or erasure coded pools seeing the backfill_toofull?
Are you seeing scrub errors with large difference in "bytes"?
cluster [ERR] #.# scrub : stat mismatch, got 10000/10000 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7569408/6832128 bytes
If bytes isn't fixed using "repair" an incorrect value would be used to determine backfill space requirements of a pg.

Actions

Copy link

Updated by David Zafman almost 5 years ago

Seeing a "ceph osd df" like Alex provided is helpful in determining what is going on. Looking at it repeatedly while there are PGs backfilling/backfill_toofull may provide clues.

Actions

Copy link Download all files

#10

Updated by Rene Diepstraten almost 5 years ago

File crushmap.txt crushmap.txt added
File ceph-osd-df-tree.txt ceph-osd-df-tree.txt added

Here I can reproduce the issue with the ssd class.
We're in the process of reinstalling/redeploying (one host with all osds marked out, one reinstalled host with new osds added).
The ssds are used for .rgw.buckets.index (mainly omap keys). When adding one osd and marking out two osds, the issue occurs.
It quickly solves itself during rebalancing.
Added crushmap and ceph osd df tree as well.

Actions

Copy link

#11

Updated by David Zafman almost 5 years ago

Rene, you have what looks more like an expected situation. With some OSDs showing as high as 72% utilization, a big rearrangement could easily try to backfill that can't fit until another PG is moved elsewhere.

Check over my list in https://tracker.ceph.com/issues/39555#note-8

Actions

Copy link

#12

Updated by Rene Diepstraten almost 5 years ago

I know that, but the issue doesn't occur with those osds. The issue occurs with the ssds (checked with `ceph pg ls backfill_toofull`) .
The backfill_toofull pgs are in the pool mapped to the class ssd osds, which are not full at all.

Actions

Copy link

#13

Updated by David Zafman almost 5 years ago

Pull request ID set to 28204

Actions

Copy link

#14

Updated by David Zafman almost 5 years ago

The pull request https://github.com/ceph/ceph/pull/28204 generates a warning with a better message.

health: HEALTH_WARN
            Low space blocking backfill (add storage if this doesn't resolve itself): 4 pgs backfill_toofull

Actions

Copy link

#15

Updated by David Zafman almost 5 years ago

Backport set to luminous, mimic, nautilus

Actions

Copy link

#16

Updated by David Zafman almost 5 years ago

Status changed from New to In Progress

Actions

Copy link

#17

Updated by Rene Diepstraten almost 5 years ago

Thanks for the PR.
The problem itself seems to be caused as follows:
- A backfill starts to a set osds
- One of the up osds is marked down
- The backfill is set to state backfill_toofull

2019-05-25 08:38:27.040147 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796775 : cluster [WRN] Health check update: 56277851/400915587 objects misplaced (14.037%) (OBJECT_MISPLACED)
2019-05-25 08:38:32.040856 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796776 : cluster [WRN] Health check update: 56273116/400915587 objects misplaced (14.036%) (OBJECT_MISPLACED)
2019-05-25 08:38:33.961543 osd.326 osd.326 [2a05:xx::f130]:6809/151061 1558 : cluster [WRN] Monitor daemon marked osd.326 down, but it is still running
2019-05-25 08:38:35.404995 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796782 : cluster [WRN] Health check failed: Degraded data redundancy: 55546/400915587 objects degraded (0.014%), 3 pgs degraded (PG_DEGRADED)
2019-05-25 08:38:37.041530 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796783 : cluster [WRN] Health check update: 56088614/400915587 objects misplaced (13.990%) (OBJECT_MISPLACED)
2019-05-25 08:38:37.410998 mon.mon02 mon.0 [2a05:xx::b20b]:6789/0 796784 : cluster [ERR] Health check failed: Degraded data redundancy (low space): 1 pg backfill_toofull (PG_DEGRADED_FULL)

I've got multiple of these in the logs over the past incidents. Sometimes because I marked them down myself, sometimes because an osd is too busy and marked down.

Under luminous I've never seen this issue.
I've also checked whether or not I could clear the backfill_toofull by using upmap to map one of the osds in the pg to another, and then removing the upmap entry again and this works:

pg     primary  acting       up           state
8.1db  55       [19,333,319] [55,19,43]   active+remapped+backfill_wait+backfill_toofull

Upmap to another osd:

ceph osd pg-upmap-items 8.1db 55 56

Wait until after peering and remove upmap entry again:

ceph osd rm-pg-upmap-items 8.1db

Now it's okay again:

pg     primary  acting       up           state
8.1db  55       [19,333,319] [55,19,43]   active+remapped+backfill_wait

This strenghtens me into believing this is actually a bug, not behaviour as described earlier by Greg Farnum

Actions

Copy link

#18

Updated by Rene Diepstraten almost 5 years ago

Note: I've only seen this in a relatively busy and full environment with quite a few backfills going on.

Actions

Copy link

#19

Updated by Maks Kowalik almost 5 years ago

In our case this issue surfaces on a relatively empty cluster (top used OSDs are around 35%) during a small rebalance (introducing a new empty node with 4 OSDs).
Worth mentioning is that the placement groups reporting backfill_toofull status are not related to the new OSDs (the new IDs are 35,36,37,66)

ceph pg ls backfill_toofull
PG     OBJECTS DEGRADED MISPLACED UNFOUND BYTES       LOG  STATE                                          STATE_STAMP                VERSION      REPORTED      UP                                         ACTING                                     SCRUB_STAMP                DEEP_SCRUB_STAMP           
14.6a     8842        0     26526       0 37025931264 3095 active+remapped+backfill_wait+backfill_toofull 2019-06-13 13:28:41.928664 72408'752455 72408:2271825    [83,135,20,111,139,53,52,145,151,32]p83    [83,34,20,111,139,53,152,145,151,55]p83 2019-06-12 11:15:08.155162 2019-06-05 14:09:19.500308 
14.e8     8644        0     34576       0 36186157091 3016 active+remapped+backfill_wait+backfill_toofull 2019-06-13 13:28:44.951664 72404'747983 72408:2304265 [138,143,54,72,154,149,111,134,146,32]p138 [138,143,111,140,154,32,47,134,146,59]p138 2019-06-12 15:28:55.513072 2019-06-12 15:28:55.513072 
14.196    8797        0     17594       0 36825878528 3024 active+remapped+backfill_wait+backfill_toofull 2019-06-13 13:28:44.751238 72404'751189 72407:2373715   [74,150,135,152,142,62,59,147,109,19]p74   [74,150,135,152,142,53,59,147,109,21]p74 2019-06-13 11:24:00.739785 2019-06-12 07:11:31.756319

ceph osd df|sort -nk 11|tail -5
 27   ssd 1.78139  1.00000 1.8 TiB 633 GiB 1.2 TiB 34.70 1.37 109 
  2   ssd 1.78139  1.00000 1.8 TiB 646 GiB 1.2 TiB 35.40 1.40 114 
 65   ssd 1.86209  1.00000 1.9 TiB 677 GiB 1.2 TiB 35.48 1.40 118 
 28   ssd 1.78139  1.00000 1.8 TiB 655 GiB 1.1 TiB 35.89 1.41 116 
 16   ssd 1.78139  1.00000 1.8 TiB 656 GiB 1.1 TiB 35.94 1.42 120

Actions

Copy link

#20

Updated by Joao Eduardo Luis almost 5 years ago

The pull request provided with the fix has been merged (https://github.com/ceph/ceph/pull/28204). Does anyone still see this happening since the merge?

@david: is this something we can close?

Actions

Copy link

#21

Updated by Nathan Cutler almost 5 years ago

Status changed from In Progress to Pending Backport

Actions

Copy link

#22

Updated by Nathan Cutler almost 5 years ago

Status changed from Pending Backport to In Progress

oops, reverting - I had not seen Joao's question

Actions

Copy link

#23

Updated by Vikhyat Umrao over 4 years ago

Related to Bug #41255: backfill_toofull seen on cluster where the most full OSD is at 1% added

Actions

Copy link

#24

Updated by Vikhyat Umrao over 4 years ago

Joao Eduardo Luis wrote:

The pull request provided with the fix has been merged (https://github.com/ceph/ceph/pull/28204). Does anyone still see this happening since the merge?

Looks like we have new report - https://tracker.ceph.com/issues/41255 I think HEALTH_ERR is one of the issues but the main concern is backfill_toofull should not be seen on a cluster where the most full OSD is at max 1% usage.

Actions

Copy link

#25

Updated by David Zafman over 4 years ago

Status changed from In Progress to Pending Backport

Actions

Copy link

#26

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41499: mimic: backfill_toofull while OSDs are not full (Unneccessary HEALTH_ERR) added

Actions

Copy link

#27

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41500: luminous: backfill_toofull while OSDs are not full (Unneccessary HEALTH_ERR) added

Actions

Copy link

#28

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41501: nautilus: backfill_toofull while OSDs are not full (Unneccessary HEALTH_ERR) added

Actions

Copy link

#29

Updated by Frank Ritchie over 4 years ago

FWIW, I am seeing this issue after an upgrade from 12.2.12 to 14.2.6.

The status is HEALTH_WARN not HEALTH_ERR but I would like to know why this behavior is not seen with Luminous.

Actions

Copy link

#30

Updated by Nathan Cutler over 4 years ago

Hi @Frank - please open a new bug report and include the output of "ceph health detail" in your bug description. This tracker issue is not the right place to report a bug in 14.2.6.

Actions

Copy link

#31

Updated by David Zafman about 4 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #39555

backfill_toofull while OSDs are not full (Unneccessary HEALTH_ERR)

Updated by Greg Farnum almost 5 years ago

Updated by Wido den Hollander almost 5 years ago

Updated by Rene Diepstraten almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by Alex Cucu almost 5 years ago

Updated by Erik Lindahl almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by Rene Diepstraten almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by Rene Diepstraten almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by Rene Diepstraten almost 5 years ago

Updated by Rene Diepstraten almost 5 years ago

Updated by Maks Kowalik almost 5 years ago

Updated by Joao Eduardo Luis almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Vikhyat Umrao over 4 years ago

Updated by Vikhyat Umrao over 4 years ago

Updated by David Zafman over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Frank Ritchie over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by David Zafman about 4 years ago