Project

General

Profile

Bug #15879

Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?)

Added by Kefu Chai almost 8 years ago. Updated almost 8 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
hammer,jewel
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

it's a spin-off of #13990 to track the issue of

We recently upgraded from firefly 0.80.9 to Hammer 0.94.3. Since our upgrade 3 weeks ago we have accumulated over 250GB of old OSD maps in the meta directory of each OSD. The OSD does not appear to be deleting old OSD maps. We generate substantially more OSD maps than most clusters because we delete RBD snapshots regularly. We have looked at the code around OSD map deletion in OSD.cc and it doesn't look like anything in the OSD code ever moves the lower_bound of the map_cache forward to expire maps.

We have attempted lowering the map_cache_size on the OSDs but this did not result in getting maps to expire and get deleted from disk. We have attempted restarting OSDs, restarting entire OSD nodes, and even marked osds out. Nothing seems to get the OSD to reset its map cache lower bound. Pretty soon we'll have to start just taking OSDs completely out of the cluster, zapping the disk, and then readd back to the cluster. We have over 100,000 OSD maps stored on each OSD and we're using about 10% of our raw disk space to store these maps so its quickly becoming a serious issue for us space-wise.

I am attaching an OSD debug 20 log that shows an iteration of handle_osd_map where the OSD clearly has way more than 500 epochs (oldest_map = 1106541, newest map = 1248379) according to the superblock write command. But it doesn't enter the loop to remove old osdmap epochs. We've set our map_cache_size back to default (500) since lowering it to 250 didn't seem to kick start any sort of cleanup.

If you need any other logs or have ideas of how we could get the OSDs to start trimming osd maps it would be very appreciated.

grep.py View (1.89 KB) Kefu Chai, 01/05/2016 11:11 AM

ceph.log.bz2 - Log after 'ceph-deploy mon create-initial' (10.6 KB) Steve Taylor, 01/08/2016 05:46 PM

ceph-osd.28.log.bz2 - Log from osd.28 showing new crash (124 KB) Steve Taylor, 01/13/2016 10:56 PM

ceph-osd.28.log.bz2 (196 KB) Steve Taylor, 01/14/2016 07:18 PM

ceph-mon.mon-eng-05-03.log.bz2 - Mon log showing osd map trimming (435 KB) Steve Taylor, 01/19/2016 05:42 PM

ceph-osd.0.log.bz2 - Osd log showing osd map trimming (818 KB) Steve Taylor, 01/19/2016 05:42 PM

ceph-osd.0.log.1.gz - osd.0 log (91.9 KB) Steve Taylor, 02/10/2016 03:11 PM

newest-minus-oldest.png View (14 KB) Kefu Chai, 03/21/2016 08:21 AM

oldest.png View (31.7 KB) Kefu Chai, 03/21/2016 08:21 AM


Related issues

Copied from Ceph - Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) Resolved 12/05/2015

History

#1 Updated by Kefu Chai almost 8 years ago

  • Copied from Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added

#3 Updated by Kefu Chai almost 8 years ago

  • Status changed from Fix Under Review to Duplicate

was opened so we can have a 1:1 mapping for the backport PR and its master tracker issue.

but closing for now. as i think we'd better stick to #13990. and have a single backport PR for it.

Also available in: Atom PDF