Project

General

Profile

Feature #57785

fragmentation score in metrics

Added by Kevin Fox 4 months ago. Updated 3 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Currently the bluestore fragmentation score does not seem to be exported in metrics. Due to the issue described in https://tracker.ceph.com/issues/57672, it would be really nice to have that metric available so it can be acted upon by a metrics/alerting system.

History

#1 Updated by Vikhyat Umrao 4 months ago

Yaarit/Laura - can we do something in telemetry perf channels?

#2 Updated by Laura Flores 4 months ago

Looks like we can get the fragmentation score via an admin socket command:

$ sudo ceph tell osd.0 bluestore allocator score block
{
    "fragmentation_rating": 0.1482800165209559
}

#3 Updated by Laura Flores 3 months ago

Hey Kevin (and Vikhyat),

I have a few questions regarding the fragmentation score:

1. Where are all the places that the fragmentation score is available? I see that it is available via the above admin socket command, but is it available elsewhere, say, in the osdmap?
3. Are there any privacy concerns related to collecting this metric?
4. What specific questions would we like to answer in collecting the fragmentation score?
5. Is there any additional information related to the fragmentation score that would be helpful to collect?
6. Telemetry scrapes metrics once every 24 hours (unless the user increases/decreases this frequency). Is a 24 hour snapshot of the fragmentation score enough to answer the questions you have surrounding the metric?

#4 Updated by Kevin Fox 3 months ago

I'm just a user so I can't answer some of the questions. I'll fill in what I know though.

1. Not sure
3. No privacy concern I know of.
4. Some background is in this thread: https://tracker.ceph.com/issues/57672 Short answer, is fragmentation score being too high, in cases of hdds, or sometimes on ssds when a metadata drive isn't used, can cause performance issues, or can be flat out break the cluster in difficult to repair ways.
5. Not sure. Perhaps a metric stating if the metadata is separate from the raw data.
6. That one's a tricky one. I managed to fragment 500GB of drive space to an unusable level in around 6 days on one cluster. So, daily would probably catch it for most clusters. But on the edge of being too infrequent.

Thanks,
Kevin

#5 Updated by Yaarit Hatuka 3 months ago

Kevin Fox wrote:

Currently the bluestore fragmentation score does not seem to be exported in metrics. Due to the issue described in https://tracker.ceph.com/issues/57672, it would be really nice to have that metric available so it can be acted upon by a metrics/alerting system.

Hi Kevin,
Are you referring to exposing the bluestore fragmentation score in the perf counters?
Thanks,
Yaarit

#6 Updated by Kevin Fox 3 months ago

Ultimately, I'd like it in prometheus, so I can setup alerts if it gets too high.

#7 Updated by Vikhyat Umrao 3 months ago

Laura - sorry I missed the update. Can you please ping Adam and Igor?

#8 Updated by Laura Flores 3 months ago

@Vikhyat, no worries. Based on Kevin's comment, I think this metric might be better suited for Prometheus than Telemetry.

#9 Updated by Laura Flores 2 months ago

@Kevin I have asked Paul Cuzner to take a look at this tracker and offer his opinion, as he has done a lot of work for Prometheus. He may be able to decide if the bluestore fragmentation score is a suitable addition to Prometheus.

#10 Updated by Kevin Fox 2 months ago

We've had to hack a script together to monitor one of our clusters, and it has been useful to catch an issue:
https://tracker.ceph.com/issues/58022

#11 Updated by Laura Flores 2 months ago

Thanks for sharing this, Kevin. We discussed this Tracker more in the Telemetry huddle, and we are curious if you would find it helpful to have this fragmentation threshold as a health warning, i.e. if the fragmentation score on any OSD exceeds 0.8, you would get a health warning about it, perhaps with some details on how to resolve it.

We would need to talk about the logistics of this with Adam Kupczyk, but what are your thoughts?

#12 Updated by Kevin Fox 2 months ago

A ceph warning for it would also be quite useful I think.
https://access.redhat.com/documentation/fr-fr/red_hat_ceph_storage/5/html/administration_guide/osd-bluestore#what-is-the-bluestore-fragmentation-tool_admin
has a definition for score meaning.

not sure where the warning should be though. somewhere in the .7 to .9 range probably.

Our cluster fell apart at ~.93 but was due to a bug. so not sure how close to .9 it should be.

Though fragmentation on ssds, without the bug, doesn't seem to be as big a deal. so maybe a way to turn it off too when fragmentation gets high but you know its not a problem? Or have a separate fragmentation level warning for hdd vs ssd?

#13 Updated by Laura Flores 2 months ago

Thanks, Kevin. Let me talk this over with Adam and Paul, and we will decide a course of action.

#14 Updated by Kevin Fox 2 months ago

❤️

#15 Updated by Laura Flores 2 months ago

We have a meeting scheduled for next week to discuss this topic.

#16 Updated by Yaarit Hatuka 2 months ago

After syncing with Adam Kupczyk today: 

In the shorter term we will make the fragmentation score, both for bluefs and bluestore, available as perf counters, which will be available both to prometheus and telemetry mgr modules. This will also allow us to generate a health warning, based on these counters.

In the longer term, after this PR (os/bluestore: enable 4K allocation unit for BlueFS -  https://github.com/ceph/ceph/pull/48854) is merged, we will only need to look at the bluestore fragmentation score (as bluefs will be aligned with it).

As for fragmentation score calculation cadence: in case the score is below a certain threshold (probably 0.5), we can calculate a score once a day. Otherwise, we can calculate it hourly. Calculation should not affect performance.

#17 Updated by Kevin Fox 2 months ago

❤️

#18 Updated by Paul Cuzner 2 months ago

I think having the metric available opens the door for monitoring escalation for prometheus and less frequently used (?) modules like influx and telegraf. I'm not sure about setting a WARN level healthcheck for it though, especially on large clusters where you'd like end up just muting it.

Regardless, I think the bigger question is automating or streamlineing the workflow once this condition has been triggered. For example, as an admin once I receive the alert/healthcheck...what do i do? Is the quickest thing to redeploy the OSD? If so, perhaps we should look at a ceph orch osd redeploy or something to keep the workflow simple?

#19 Updated by Kevin Fox about 2 months ago

I didn't know it was a problem until I tripped across it. The warning I think does more help then harm. Having a documentation item that says, if you see this warning, look for X Y and Z and maybe its not a big problem for you, you can turn it off like X, reduces the impact for those it doesn't really affect, and its a 1 time thing. But the warning would still help those that don't know to even look for a potential problem?

+1 to having an ceph orch osd redeploy.

#20 Updated by Kevin Fox 4 days ago

Any updates on this?

Thanks,
Kevin

#21 Updated by Yaarit Hatuka 4 days ago

Hi Kevin,
We will implement the aligned fragmentation score after we merge https://github.com/ceph/ceph/pull/48854.

#22 Updated by Kevin Fox 3 days ago

Ok. Thanks.

Also available in: Atom PDF