Project

General

Profile

Feature #47295

Optimize ceph-volume inventory to reduce runtime

Added by Paul Cuzner 15 days ago. Updated 11 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

The inventory process currently relies on repeated invocation of subprocess calls, which are expensive. On my test system (16 drives), the inventory command issued over 160 calls and took 7 secs to complete.

The goal of this feature is to optimise how the data is gathered to reduce the overheads which in turn will reduce the runtime to the user/caller.


Related issues

Related to ceph-volume - Bug #37490: ceph-volume lvm list is O(n^2) Resolved 11/30/2018

History

#1 Updated by Paul Cuzner 15 days ago

there's a couple of things that impact the runtime that I need some background on

for every block device,
we run the ceph-bluestore-tool, but bluestore is configured on LV's so every command fails anyway
we query for the first lv - but we pass the physical device during an inventory not the vg/lv - so again this returns nothing and soaks time

I've batched up some of the lsblk, and pvs commands, and skipped the two scenarios above and on my test system this brings the runtime of an inventory down from >7s to ~3 (16 devices)

Can anyone comment on the above?

#2 Updated by Jan Fajerski 15 days ago

  • Related to Bug #37490: ceph-volume lvm list is O(n^2) added

#3 Updated by Jan Fajerski 15 days ago

Paul Cuzner wrote:

there's a couple of things that impact the runtime that I need some background on

for every block device,
we run the ceph-bluestore-tool, but bluestore is configured on LV's so every command fails anyway

This was introduced with the new raw mode, which can deploy OSDs on raw block devices. To identify these we call ceph-bluestore-tool.

we query for the first lv - but we pass the physical device during an inventory not the vg/lv - so again this returns nothing and soaks time

I think this is due to the fairly new different availability notions. Look for available_lvm and available_raw in util/device.py

I've batched up some of the lsblk, and pvs commands, and skipped the two scenarios above and on my test system this brings the runtime of an inventory down from >7s to ~3 (16 devices)

Can anyone comment on the above?

We started work to improve this already a while ago, see the related issue.
It comes down to the Device class in util/device.py. This class is widely used and was extended for various purposes, so there is a lot of bloat. I would love a major refactor of this class, but due to time constraints and complexity of the task this is still on the back burner. I'm pretty sure we could also optimize the way we dispatch to the subprocess module.

tl;dr: THis is part of the significant tech dept ceph-volume in ceph-volume. I don't think there is a quick fix, since this class is used everywhere but a major rewrite of it would probably pay off.

#4 Updated by Paul Cuzner 12 days ago

Agree a rewrite is probably the better long term goal - but ultimately, if the code relies on lvs/pvs/vgs/blkid/lsblk and ceph-bluestore-tool it's going to be problematic anyway.

The simplest and least risk way to reduce inventory runtime is to multi-thread the Device object creation. In my tests this cuts the runtime by 1/2, with 4 threads (more than 4 doesn't yield further gains, so I suspect contention somewhere...parhaps in lvm)

I think that this be worthwhile as an interim step.

#6 Updated by Nathan Cutler 11 days ago

  • Status changed from New to Fix Under Review
  • Assignee set to Paul Cuzner
  • Pull request ID set to 37013

Also available in: Atom PDF