Feature #57785: fragmentation score in metrics - bluestore - Ceph

Actions

Copy link

Feature #57785

open

fragmentation score in metrics

Added by Kevin Fox over 1 year ago. Updated about 1 month ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

Currently the bluestore fragmentation score does not seem to be exported in metrics. Due to the issue described in https://tracker.ceph.com/issues/57672, it would be really nice to have that metric available so it can be acted upon by a metrics/alerting system.

Actions

Copy link

Updated by Vikhyat Umrao over 1 year ago

Yaarit/Laura - can we do something in telemetry perf channels?

Actions

Copy link

Updated by Laura Flores over 1 year ago

Looks like we can get the fragmentation score via an admin socket command:

$ sudo ceph tell osd.0 bluestore allocator score block
{
    "fragmentation_rating": 0.1482800165209559
}

Actions

Copy link

Updated by Laura Flores over 1 year ago

Hey Kevin (and Vikhyat),

I have a few questions regarding the fragmentation score:

1. Where are all the places that the fragmentation score is available? I see that it is available via the above admin socket command, but is it available elsewhere, say, in the osdmap?
3. Are there any privacy concerns related to collecting this metric?
4. What specific questions would we like to answer in collecting the fragmentation score?
5. Is there any additional information related to the fragmentation score that would be helpful to collect?
6. Telemetry scrapes metrics once every 24 hours (unless the user increases/decreases this frequency). Is a 24 hour snapshot of the fragmentation score enough to answer the questions you have surrounding the metric?

Actions

Copy link

Updated by Kevin Fox over 1 year ago

I'm just a user so I can't answer some of the questions. I'll fill in what I know though.

1. Not sure
3. No privacy concern I know of.
4. Some background is in this thread: https://tracker.ceph.com/issues/57672 Short answer, is fragmentation score being too high, in cases of hdds, or sometimes on ssds when a metadata drive isn't used, can cause performance issues, or can be flat out break the cluster in difficult to repair ways.
5. Not sure. Perhaps a metric stating if the metadata is separate from the raw data.
6. That one's a tricky one. I managed to fragment 500GB of drive space to an unusable level in around 6 days on one cluster. So, daily would probably catch it for most clusters. But on the edge of being too infrequent.

Thanks,
Kevin

Actions

Copy link

Updated by Yaarit Hatuka over 1 year ago

Kevin Fox wrote:

Currently the bluestore fragmentation score does not seem to be exported in metrics. Due to the issue described in https://tracker.ceph.com/issues/57672, it would be really nice to have that metric available so it can be acted upon by a metrics/alerting system.

Hi Kevin,
Are you referring to exposing the bluestore fragmentation score in the perf counters?
Thanks,
Yaarit

Actions

Copy link

Updated by Kevin Fox over 1 year ago

Ultimately, I'd like it in prometheus, so I can setup alerts if it gets too high.

Actions

Copy link

Updated by Vikhyat Umrao over 1 year ago

Laura - sorry I missed the update. Can you please ping Adam and Igor?

Actions

Copy link

Updated by Laura Flores over 1 year ago

@Vikhyat Umrao, no worries. Based on Kevin's comment, I think this metric might be better suited for Prometheus than Telemetry.

Actions

Copy link

Updated by Laura Flores over 1 year ago

@Kevin H I have asked Paul Cuzner to take a look at this tracker and offer his opinion, as he has done a lot of work for Prometheus. He may be able to decide if the bluestore fragmentation score is a suitable addition to Prometheus.

Actions

Copy link

#10

Updated by Kevin Fox over 1 year ago

We've had to hack a script together to monitor one of our clusters, and it has been useful to catch an issue:
https://tracker.ceph.com/issues/58022

Actions

Copy link

#11

Updated by Laura Flores over 1 year ago

Thanks for sharing this, Kevin. We discussed this Tracker more in the Telemetry huddle, and we are curious if you would find it helpful to have this fragmentation threshold as a health warning, i.e. if the fragmentation score on any OSD exceeds 0.8, you would get a health warning about it, perhaps with some details on how to resolve it.

We would need to talk about the logistics of this with Adam Kupczyk, but what are your thoughts?

Actions

Copy link

#12

Updated by Kevin Fox over 1 year ago

A ceph warning for it would also be quite useful I think.
https://access.redhat.com/documentation/fr-fr/red_hat_ceph_storage/5/html/administration_guide/osd-bluestore#what-is-the-bluestore-fragmentation-tool_admin
has a definition for score meaning.

not sure where the warning should be though. somewhere in the .7 to .9 range probably.

Our cluster fell apart at ~.93 but was due to a bug. so not sure how close to .9 it should be.

Though fragmentation on ssds, without the bug, doesn't seem to be as big a deal. so maybe a way to turn it off too when fragmentation gets high but you know its not a problem? Or have a separate fragmentation level warning for hdd vs ssd?

Actions

Copy link

#13

Updated by Laura Flores over 1 year ago

Thanks, Kevin. Let me talk this over with Adam and Paul, and we will decide a course of action.

Actions

Copy link

#14

Updated by Kevin Fox over 1 year ago

❤️

Actions

Copy link

#15

Updated by Laura Flores over 1 year ago

We have a meeting scheduled for next week to discuss this topic.

Actions

Copy link

#16

Updated by Yaarit Hatuka over 1 year ago

After syncing with Adam Kupczyk today:

In the shorter term we will make the fragmentation score, both for bluefs and bluestore, available as perf counters, which will be available both to prometheus and telemetry mgr modules. This will also allow us to generate a health warning, based on these counters.

In the longer term, after this PR (os/bluestore: enable 4K allocation unit for BlueFS - https://github.com/ceph/ceph/pull/48854) is merged, we will only need to look at the bluestore fragmentation score (as bluefs will be aligned with it).

As for fragmentation score calculation cadence: in case the score is below a certain threshold (probably 0.5), we can calculate a score once a day. Otherwise, we can calculate it hourly. Calculation should not affect performance.

Actions

Copy link

#17

Updated by Kevin Fox over 1 year ago

❤️

Actions

Copy link

#18

Updated by Paul Cuzner over 1 year ago

I think having the metric available opens the door for monitoring escalation for prometheus and less frequently used (?) modules like influx and telegraf. I'm not sure about setting a WARN level healthcheck for it though, especially on large clusters where you'd like end up just muting it.

Regardless, I think the bigger question is automating or streamlineing the workflow once this condition has been triggered. For example, as an admin once I receive the alert/healthcheck...what do i do? Is the quickest thing to redeploy the OSD? If so, perhaps we should look at a ceph orch osd redeploy or something to keep the workflow simple?

Actions

Copy link

#19

Updated by Kevin Fox over 1 year ago

I didn't know it was a problem until I tripped across it. The warning I think does more help then harm. Having a documentation item that says, if you see this warning, look for X Y and Z and maybe its not a big problem for you, you can turn it off like X, reduces the impact for those it doesn't really affect, and its a 1 time thing. But the warning would still help those that don't know to even look for a potential problem?

+1 to having an ceph orch osd redeploy.

Actions

Copy link

#20