Project

General

Profile

Actions

Bug #15879

closed

Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?)

Added by Kefu Chai almost 8 years ago. Updated almost 8 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
hammer,jewel
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

it's a spin-off of #13990 to track the issue of

We recently upgraded from firefly 0.80.9 to Hammer 0.94.3. Since our upgrade 3 weeks ago we have accumulated over 250GB of old OSD maps in the meta directory of each OSD. The OSD does not appear to be deleting old OSD maps. We generate substantially more OSD maps than most clusters because we delete RBD snapshots regularly. We have looked at the code around OSD map deletion in OSD.cc and it doesn't look like anything in the OSD code ever moves the lower_bound of the map_cache forward to expire maps.

We have attempted lowering the map_cache_size on the OSDs but this did not result in getting maps to expire and get deleted from disk. We have attempted restarting OSDs, restarting entire OSD nodes, and even marked osds out. Nothing seems to get the OSD to reset its map cache lower bound. Pretty soon we'll have to start just taking OSDs completely out of the cluster, zapping the disk, and then readd back to the cluster. We have over 100,000 OSD maps stored on each OSD and we're using about 10% of our raw disk space to store these maps so its quickly becoming a serious issue for us space-wise.

I am attaching an OSD debug 20 log that shows an iteration of handle_osd_map where the OSD clearly has way more than 500 epochs (oldest_map = 1106541, newest map = 1248379) according to the superblock write command. But it doesn't enter the loop to remove old osdmap epochs. We've set our map_cache_size back to default (500) since lowering it to 250 didn't seem to kick start any sort of cleanup.

If you need any other logs or have ideas of how we could get the OSDs to start trimming osd maps it would be very appreciated.


Files

grep.py (1.89 KB) grep.py Kefu Chai, 01/05/2016 11:11 AM
ceph.log.bz2 (10.6 KB) ceph.log.bz2 Log after 'ceph-deploy mon create-initial' Steve Taylor, 01/08/2016 05:46 PM
ceph-osd.28.log.bz2 (124 KB) ceph-osd.28.log.bz2 Log from osd.28 showing new crash Steve Taylor, 01/13/2016 10:56 PM
ceph-osd.28.log.bz2 (196 KB) ceph-osd.28.log.bz2 Steve Taylor, 01/14/2016 07:18 PM
ceph-mon.mon-eng-05-03.log.bz2 (435 KB) ceph-mon.mon-eng-05-03.log.bz2 Mon log showing osd map trimming Steve Taylor, 01/19/2016 05:42 PM
ceph-osd.0.log.bz2 (818 KB) ceph-osd.0.log.bz2 Osd log showing osd map trimming Steve Taylor, 01/19/2016 05:42 PM
ceph-osd.0.log.1.gz (91.9 KB) ceph-osd.0.log.1.gz osd.0 log Steve Taylor, 02/10/2016 03:11 PM
newest-minus-oldest.png (14 KB) newest-minus-oldest.png Kefu Chai, 03/21/2016 08:21 AM
oldest.png (31.7 KB) oldest.png Kefu Chai, 03/21/2016 08:21 AM

Related issues 1 (0 open1 closed)

Copied from Ceph - Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?)ResolvedKefu Chai12/05/2015

Actions
Actions #1

Updated by Kefu Chai almost 8 years ago

  • Copied from Bug #13990: Hammer (0.94.3) OSD does not delete old OSD Maps in a timely fashion (maybe at all?) added
Actions #3

Updated by Kefu Chai almost 8 years ago

  • Status changed from Fix Under Review to Duplicate

was opened so we can have a 1:1 mapping for the backport PR and its master tracker issue.

but closing for now. as i think we'd better stick to #13990. and have a single backport PR for it.

Actions

Also available in: Atom PDF