Project

General

Profile

Actions

Bug #16177

closed

leveldb horrendously slow

Added by Adam Tygart almost 8 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently ran into an issue using cephfs where loading pg data on osd start (with some lightning fast ssds) took long enough to cause the osd to suicide timeout. This caused cascading suicide timeouts through more than half of my ssds (used exclusively for metadata within cephfs).

The mailing list suggested using ceph-objectstore-tool to export the pgs, delete them off the osd and re-import them. Preferably not in that order, but I digress.

Below is a complete archive of an osd (only 2GB of data compressed, and 8GB uncompressed) and a pg that took just over 3 days to extract (200MB of data).
http://people.cis.ksu.edu/~mozes/ceph-osd.16.tar.gz
http://people.cis.ksu.edu/~mozes/osd.16.32.10c.out

During the extraction, ceph-objectstore-tool pegged a single cpu for the duration. Something has to be wrong here, I cannot imagine an algorithm performing this poorly on a data structure that should be able to fit entirely in main memory. We're now at almost day 5 of extracting slow (damaged?) pgs off of 8 ssds.

My understanding is that leveldb can start to misbehave on objects with 300+million keys, (based on mailing list reports), but I don't believe I've got any objects that have more than 1 million keys. Whats more is that 300+million keys still only took 8 hours to extract based on other reports.

If I can provide any other information, I'd be glad to. More detail is available on the mailing list archives ( http://thread.gmane.org/gmane.comp.file-systems.ceph.user/30016 )

Actions

Also available in: Atom PDF