Bug #37875
osdmaps aren't being cleaned up automatically on healthy cluster
0%
Description
After doing an expansion from ~1,500 OSDs to ~1,900 OSDs on a Luminous 12.2.8 cluster using FileStore, I've noticed that osdmaps aren't being trimmed automatically:
- find /var/lib/ceph/osd/ceph-1719/current/meta -name 'osdmap*' | wc -l
42515
With an average size of ~700KB, this adds up to around 30GB/OSD:
- du -sh /var/lib/ceph/osd/ceph-1719/current/meta
30G /var/lib/ceph/osd/ceph-1719/current/meta
I remember something like this happening during the hammer days (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html), and the same workaround works for trimming them slowly:
for i in {1653..1719}; do ceph osd crush reweight osd.$i 4.00001; sleep 4; ceph osd crush reweight osd.$i 4; sleep 4; done
Related issues
History
#1 Updated by Dan van der Ster over 4 years ago
I just updated a cluster from v12.2.8 to 12.2.10.
At the beginning we had:
"oldest_map": 281368,
"newest_map": 295558,
During the upgrade, the oldest_map was being trimmed by ~30-60 each time newest_map incremented.
Once oldest_map was within ~500 of newest_map, the oldest_map stopped being trimmed at all.
So post upgrade we have:
"oldest_map": 295441,
"newest_map": 296138,
Next I restarted all ceph-mon's and started some osdmap churn (e.g. ceph osd set pool min_size test 2).
After this, the oldest_map is getting trimmed again!!!
My conclusion: the ceph-mon has a bug related to paxos (min/max) options, which are ~500 by default, and after the oldest - newest < 500, there are no more trimming.
BTW, I guess this is the cause of mon stores growing unbounded until mon's are restarted. See this thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029423.html
#2 Updated by Dan van der Ster over 4 years ago
Hrm.. actually, after enabling debug_paxos=10 on the mon leader, I see that there's a hysteresis between 500 and 750:
2019-01-15 15:36:34.358681 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296378) maybe_trim trim_to 295878 would only trim 242 < paxos_service_trim_min 250 2019-01-15 15:36:39.359499 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296382) maybe_trim trim_to 295882 would only trim 246 < paxos_service_trim_min 250 2019-01-15 15:36:44.360434 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296385) maybe_trim trim_to 295885 would only trim 249 < paxos_service_trim_min 250 2019-01-15 15:36:49.361568 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296389) maybe_trim trimming to 295889, 253 states 2019-01-15 15:36:49.361588 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295636..296389) trim from 295636 to 295889 2019-01-15 15:36:54.505648 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295889..296391) maybe_trim trim_to 295891 would only trim 2 < paxos_service_trim_min 250 2019-01-15 15:36:59.506462 7f7ea9f11700 10 mon.cepherin-mon-7cb9b591e1@0(leader).paxosservice(osdmap 295889..296395) maybe_trim trim_to 295895 would only trim 6 < paxos_service_trim_min 250
So (at least in 12.2.10) the oldest_map should stay within 750 of the newest. I'll watch this awhile.
#3 Updated by Greg Farnum over 4 years ago
- Project changed from Ceph to RADOS
Sounds like Dan's is behaving as expected, but if there's any more info about Bryan's let us know.
#4 Updated by Joachim Kraftmayer over 4 years ago
We have seen the same behavior with luminous and +2500 osds in late 2017. Analyzed with Joao then, but he could not reprodue it. The problem has occurred with Luminous at least three times in 2018. Solved it by the restart of the leader.
#5 Updated by Dan van der Ster about 4 years ago
Still ongoing here, with mimic too. On one 13.2.6 cluster we have this, for example:
ceph daemon osd.2 status Mon Sep 23 16:26:32 2019 { "cluster_fsid": "eecca9ab-161c-474c-9521-0e5118612dbb", "osd_fsid": "968683e6-cc0c-495b-adb6-f9810fb90740", "whoami": 2, "state": "active", "oldest_map": 388403, "newest_map": 457449, "num_pgs": 190 }
Does master or nautilus have some more proactive trimming of old osdmaps?
#6 Updated by Dan van der Ster almost 4 years ago
I may have found more about what's causing this.
I have a cluster with ~1600 uncleaned maps.
# ceph report | jq .osdmap_first_committed 112300 # ceph report | jq .osdmap_last_committed 113938
The osdmap maybe_trim() and get_trim_to() claim that the last clean epoch is 112327:
2019-11-14 16:00:47.994194 7ff909b55700 10 mon.cephgabe0@0(leader).paxosservice(osdmap 112300..113936) maybe_trim trim_to 112327 would only trim 27 < paxos_service_trim_min 250 2019-11-14 16:02:13.089240 7ff909b55700 10 mon.cephgabe0@0(leader).osd e113936 min_last_epoch_clean 112327
I found that this incorrect min clean epoch is coming from osd.706:
# ceph pg dump -f json-pretty .. "osd_epochs": [ { "osd": 707, "epoch": 113936 }, { "osd": 706, "epoch": 112327 }, { "osd": 705, "epoch": 113936 }, ...
Indeed osd.706 had failed quite some time ago (back in July). The cluster had recovered from that ages ago and is currently HEALTH_OK. But osd.706 is still down.
# ceph osd tree down ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -26 1965.44482 root default -175 1965.44482 room 0513-R-0050 -24 163.70998 rack RJ55 -169 54.56999 host p05151113760120 706 hdd 2.72800 osd.706 down 0 1.00000
I tried destroying the osd but it didn't remove that osd epoch from osd_epochs:
# ceph osd destroy 706 --yes-i-really-mean-it destroyed osd.706 # ceph osd tree down ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -26 1965.44482 root default -175 1965.44482 room 0513-R-0050 -24 163.70998 rack RJ55 -169 54.56999 host p05151113760120 706 hdd 2.72800 osd.706 destroyed 0 1.00000 # ceph pg dump -f json-pretty | grep 112327 -B1 "osd": 706, "epoch": 112327
Now I restarted all the ceph-mons, then we have min_last_epoch_clean 0:
2019-11-14 16:29:45.270437 7f0ecadf2700 10 mon.cephgabe0@0(leader).osd e113938 min_last_epoch_clean 0
After 5 minutes it jumps to a reasonable clean number:
2019-11-14 16:29:50.275094 7f0ecadf2700 10 mon.cephgabe0@0(leader).osd e113938 min_last_epoch_clean 113937
And now the maps trim:
# ceph report | jq .osdmap_first_committed 113300 # ceph report | jq .osdmap_last_committed 113938
So I suspect that old failed osds are keeping the min_last_epoch_clean back at the osdmap epoch when that osd failed, preventing any osdmaps from getting trimmed while that old osd still exists/is down.
Any ideas how to fix this? Maybe OSDMonitor::get_min_last_epoch_clean() should return the current osdmap epoch if all PGs are active+clean?
#7 Updated by Dan van der Ster almost 4 years ago
I think this is when this started:
commit b5dd112ce220e97ec9406779ab8c20841580e5f8 Author: Kefu Chai <kchai@redhat.com> Date: Thu Apr 13 19:08:45 2017 +0800 mon/OSDMonitor: use lec in OSDBeacon to trim osdmap Signed-off-by: Kefu Chai <kchai@redhat.com>
#8 Updated by Brad Hubbard over 3 years ago
https://github.com/ceph/ceph/pull/19076 is a possible solution to this issue.
#9 Updated by Greg Farnum over 3 years ago
- Duplicated by Bug #44419: ops stuck on "wait for new map" for no apparent reason added
#10 Updated by Nikola Ciprich over 3 years ago
Hi Greg, are you sure this ticket is a duplicate? In my case, the cluster is healthy with no DOWN OSDs, so it shouldn't be that problem..
#11 Updated by Bryan Stillwell over 3 years ago
This also affects Nautilus v14.2.6. On one of our clusters it caused multiple performance problems when we were rebooting our OSD nodes to get on the latest Linux kernel. I believe this was because every OSD in the cluster was temporarily overloaded trying to clean up 1,000+ osdmaps in parallel on each node reboot until the number of osdmaps dropped down to ~500.
#12 Updated by Nathan Cutler over 3 years ago
- Status changed from New to Duplicate
#13 Updated by Nathan Cutler over 3 years ago
- Duplicates Bug #45400: mon/OSDMonitor: maps not trimmed if osds are down added
#14 Updated by Nathan Cutler over 3 years ago
- Duplicated by deleted (Bug #44419: ops stuck on "wait for new map" for no apparent reason)
#15 Updated by Nathan Cutler over 3 years ago
nautilus backport tracked by https://tracker.ceph.com/issues/45402
#16 Updated by Kefu Chai about 3 years ago
- Copied to Bug #47290: osdmaps aren't being cleaned up automatically on healthy cluster added