Project

General

Profile

Bug #37502

lvm batch potentially creates multi-pv volume groups

Added by Jan Fajerski 9 months ago. Updated 20 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
12/03/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Both bluestore and filestore MixedStrategy create one volume group if multiple free SSDs are detected. This can create scenarios where a single bad ssd device takes down significantly more OSDs than necessary.

Consider a machine with 2 SSDs and 10 spinners. A batch call with all drives will create one vg on both SSDs and place wal/db volume on this vg. If one SSD goes bad, the single vg is inaccessible and in turn all OSDs on the machine go down.

The better implementation would be to create one vg per pv/device.

History

#1 Updated by Jan Fajerski 9 months ago

I guess the advantage of the current implementation is that a single vg is easier to manage?

#2 Updated by Alfredo Deza 9 months ago

That is correct, we did try to implement this with one VG per backing device and it was incredibly difficult. Doing a single VG allows a far simpler implementation (although quite complex still).

#3 Updated by Martin Weiss 5 months ago

So in case we have a ratio of i.e. 24:4 for spinner vs. NVMe setup - it is expected that a single NVMe failure takes the whole OSD host out of business?
In this case - is it possible to fall back to the previous non-LVM deployment method?

#4 Updated by Alfredo Deza 5 months ago

Martin, you can create the LVs in any way that might be preferable for you, and then pass those onto ceph-volume (no batch):

ceph-volume lvm create --data /path/to/data-lv --db /path/to/data-db

The caveat being that it will involve more work to sort that out (and whatever failure domain you need).

#5 Updated by Martin Weiss 5 months ago

Alfredo Deza wrote:

Martin, you can create the LVs in any way that might be preferable for you, and then pass those onto ceph-volume (no batch):

[...]

The caveat being that it will involve more work to sort that out (and whatever failure domain you need).

Thanks for the quick reply!

So if I understand this right we must not use the batch mode in productive environments as it creates the single point of failure VG and due to that do not use the value of ceph-volume handled LVM management?

Instead of this we should run sequentially through a self created process for VG and LV creation and then run "ceph-volume lvm create" sequentially?

#6 Updated by Jan Fajerski 4 months ago

  • Description updated (diff)

#7 Updated by Daniel Oliveira 21 days ago

I started looking into this last week and already testing some changes we could propose in order to create one VG per OSD. Any further ideas/thoughts on this we should take into consideration now, while still testing/prosing these changes?

#8 Updated by Jan Fajerski 20 days ago

Andrew and I chatted about this at some point. Andrew had the idea to push the lv creation into the the code of the create subcommand. This would then also create pv and vg if not already present. This seems like a good approach if you're not having plans already.

Also available in: Atom PDF