Bug #15879
Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?)
0%
Description
it's a spin-off of #13990 to track the issue of
We recently upgraded from firefly 0.80.9 to Hammer 0.94.3. Since our upgrade 3 weeks ago we have accumulated over 250GB of old OSD maps in the meta directory of each OSD. The OSD does not appear to be deleting old OSD maps. We generate substantially more OSD maps than most clusters because we delete RBD snapshots regularly. We have looked at the code around OSD map deletion in OSD.cc and it doesn't look like anything in the OSD code ever moves the lower_bound of the map_cache forward to expire maps.
We have attempted lowering the map_cache_size on the OSDs but this did not result in getting maps to expire and get deleted from disk. We have attempted restarting OSDs, restarting entire OSD nodes, and even marked osds out. Nothing seems to get the OSD to reset its map cache lower bound. Pretty soon we'll have to start just taking OSDs completely out of the cluster, zapping the disk, and then readd back to the cluster. We have over 100,000 OSD maps stored on each OSD and we're using about 10% of our raw disk space to store these maps so its quickly becoming a serious issue for us space-wise.
I am attaching an OSD debug 20 log that shows an iteration of handle_osd_map where the OSD clearly has way more than 500 epochs (oldest_map = 1106541, newest map = 1248379) according to the superblock write command. But it doesn't enter the loop to remove old osdmap epochs. We've set our map_cache_size back to default (500) since lowering it to 250 didn't seem to kick start any sort of cleanup.
If you need any other logs or have ideas of how we could get the OSDs to start trimming osd maps it would be very appreciated.
Related issues
History
#1 Updated by Kefu Chai almost 8 years ago
- Copied from Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added
#2 Updated by Kefu Chai almost 8 years ago
#3 Updated by Kefu Chai almost 8 years ago
- Status changed from Fix Under Review to Duplicate
was opened so we can have a 1:1 mapping for the backport PR and its master tracker issue.
but closing for now. as i think we'd better stick to #13990. and have a single backport PR for it.