Bug #37875: osdmaps aren't being cleaned up automatically on healthy cluster - RADOS - Ceph

Actions

Copy link

Bug #37875

closed

osdmaps aren't being cleaned up automatically on healthy cluster

Added by Bryan Stillwell over 5 years ago. Updated almost 4 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.8

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After doing an expansion from ~1,500 OSDs to ~1,900 OSDs on a Luminous 12.2.8 cluster using FileStore, I've noticed that osdmaps aren't being trimmed automatically:

find /var/lib/ceph/osd/ceph-1719/current/meta -name 'osdmap*' | wc -l
42515

With an average size of ~700KB, this adds up to around 30GB/OSD:

du -sh /var/lib/ceph/osd/ceph-1719/current/meta
30G /var/lib/ceph/osd/ceph-1719/current/meta

I remember something like this happening during the hammer days (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html), and the same workaround works for trimming them slowly:

for i in {1653..1719}; do ceph osd crush reweight osd.$i 4.00001; sleep 4; ceph osd crush reweight osd.$i 4; sleep 4; done

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Dan van der Ster over 5 years ago

I just updated a cluster from v12.2.8 to 12.2.10.
At the beginning we had:

"oldest_map": 281368,
    "newest_map": 295558,

During the upgrade, the oldest_map was being trimmed by ~30-60 each time newest_map incremented.
Once oldest_map was within ~500 of newest_map, the oldest_map stopped being trimmed at all.

So post upgrade we have:

"oldest_map": 295441,
    "newest_map": 296138,

Next I restarted all ceph-mon's and started some osdmap churn (e.g. ceph osd set pool min_size test 2).
After this, the oldest_map is getting trimmed again!!!

My conclusion: the ceph-mon has a bug related to paxos (min/max) options, which are ~500 by default, and after the oldest - newest < 500, there are no more trimming.

BTW, I guess this is the cause of mon stores growing unbounded until mon's are restarted. See this thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029423.html

Actions

Copy link

Updated by Dan van der Ster over 5 years ago

Hrm.. actually, after enabling debug_paxos=10 on the mon leader, I see that there's a hysteresis between 500 and 750:

2019-01-15 15:36:34.358681 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296378) maybe_trim trim_to 295878 would only trim 242 < paxos_service_trim_min 250
2019-01-15 15:36:39.359499 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296382) maybe_trim trim_to 295882 would only trim 246 < paxos_service_trim_min 250
2019-01-15 15:36:44.360434 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296385) maybe_trim trim_to 295885 would only trim 249 < paxos_service_trim_min 250
2019-01-15 15:36:49.361568 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296389) maybe_trim trimming to 295889, 253 states
2019-01-15 15:36:49.361588 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296389) trim from 295636 to 295889
2019-01-15 15:36:54.505648 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295889..296391) maybe_trim trim_to 295891 would only trim 2 < paxos_service_trim_min 250
2019-01-15 15:36:59.506462 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295889..296395) maybe_trim trim_to 295895 would only trim 6 < paxos_service_trim_min 250

So (at least in 12.2.10) the oldest_map should stay within 750 of the newest. I'll watch this awhile.

Actions

Copy link

Updated by Greg Farnum about 5 years ago

Project changed from Ceph to RADOS

Sounds like Dan's is behaving as expected, but if there's any more info about Bryan's let us know.

Actions

Copy link

Updated by Joachim Kraftmayer about 5 years ago

We have seen the same behavior with luminous and +2500 osds in late 2017. Analyzed with Joao then, but he could not reprodue it. The problem has occurred with Luminous at least three times in 2018. Solved it by the restart of the leader.

Actions

Copy link

Updated by Dan van der Ster over 4 years ago

Still ongoing here, with mimic too. On one 13.2.6 cluster we have this, for example:

ceph daemon osd.2 status                                                                                                                                                                                    Mon Sep 23 16:26:32 2019

{
    "cluster_fsid": "eecca9ab-161c-474c-9521-0e5118612dbb",
    "osd_fsid": "968683e6-cc0c-495b-adb6-f9810fb90740",
    "whoami": 2,
    "state": "active",
    "oldest_map": 388403,
    "newest_map": 457449,
    "num_pgs": 190
}

Does master or nautilus have some more proactive trimming of old osdmaps?

Actions

Copy link

Updated by Dan van der Ster over 4 years ago

I may have found more about what's causing this.

I have a cluster with ~1600 uncleaned maps.

# ceph report | jq .osdmap_first_committed
112300
# ceph report | jq .osdmap_last_committed
113938

The osdmap maybe_trim() and get_trim_to() claim that the last clean epoch is 112327:

2019-11-14 16:00:47.994194 7ff909b55700 10 mon.cephgabe0@0(leader).paxosservice(osdmap 112300..113936) maybe_trim trim_to 112327 would only trim 27 < paxos_service_trim_min 250
2019-11-14 16:02:13.089240 7ff909b55700 10 mon.cephgabe0@0(leader).osd e113936  min_last_epoch_clean 112327

I found that this incorrect min clean epoch is coming from osd.706:

# ceph pg dump -f json-pretty 
..
  "osd_epochs": [
    {
      "osd": 707,
      "epoch": 113936
    },
    {
      "osd": 706,
      "epoch": 112327
    },
    {
      "osd": 705,
      "epoch": 113936
    },
...

Indeed osd.706 had failed quite some time ago (back in July). The cluster had recovered from that ages ago and is currently HEALTH_OK. But osd.706 is still down.

# ceph osd tree down
ID   CLASS WEIGHT     TYPE NAME                        STATUS REWEIGHT PRI-AFF 
 -26       1965.44482 root default                                             
-175       1965.44482     room 0513-R-0050                                     
 -24        163.70998         rack RJ55                                        
-169         54.56999             host p05151113760120                         
 706   hdd    2.72800                 osd.706            down        0 1.00000

I tried destroying the osd but it didn't remove that osd epoch from osd_epochs:

# ceph osd destroy 706 --yes-i-really-mean-it
destroyed osd.706
# ceph osd tree down
ID   CLASS WEIGHT     TYPE NAME                        STATUS    REWEIGHT PRI-AFF 
 -26       1965.44482 root default                                                
-175       1965.44482     room 0513-R-0050                                        
 -24        163.70998         rack RJ55                                           
-169         54.56999             host p05151113760120                            
 706   hdd    2.72800                 osd.706          destroyed        0 1.00000 

# ceph pg dump -f json-pretty | grep 112327 -B1
            "osd": 706,
            "epoch": 112327

Now I restarted all the ceph-mons, then we have min_last_epoch_clean 0:

2019-11-14 16:29:45.270437 7f0ecadf2700 10 mon.cephgabe0@0(leader).osd e113938  min_last_epoch_clean 0

After 5 minutes it jumps to a reasonable clean number:

2019-11-14 16:29:50.275094 7f0ecadf2700 10 mon.cephgabe0@0(leader).osd e113938  min_last_epoch_clean 113937

And now the maps trim:

# ceph report | jq .osdmap_first_committed
113300
# ceph report | jq .osdmap_last_committed
113938

So I suspect that old failed osds are keeping the min_last_epoch_clean back at the osdmap epoch when that osd failed, preventing any osdmaps from getting trimmed while that old osd still exists/is down.

Any ideas how to fix this? Maybe OSDMonitor::get_min_last_epoch_clean() should return the current osdmap epoch if all PGs are active+clean?

Actions

Copy link

Updated by Dan van der Ster over 4 years ago

I think this is when this started:

commit b5dd112ce220e97ec9406779ab8c20841580e5f8
Author: Kefu Chai <kchai@redhat.com>
Date:   Thu Apr 13 19:08:45 2017 +0800

    mon/OSDMonitor: use lec in OSDBeacon to trim osdmap

    Signed-off-by: Kefu Chai <kchai@redhat.com>

Actions

Copy link

Copied to Bug #47290: osdmaps aren't being cleaned up automatically on healthy cluster added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #37875

osdmaps aren't being cleaned up automatically on healthy cluster

Updated by Dan van der Ster over 5 years ago

Updated by Dan van der Ster over 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Joachim Kraftmayer about 5 years ago

Updated by Dan van der Ster over 4 years ago

Updated by Dan van der Ster over 4 years ago

Updated by Dan van der Ster over 4 years ago

Updated by Brad Hubbard about 4 years ago

Updated by Greg Farnum about 4 years ago

Updated by Nikola Ciprich about 4 years ago

Updated by Bryan Stillwell about 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Nathan Cutler almost 4 years ago

Updated by Kefu Chai over 3 years ago