Project

General

Profile

Actions

Bug #44323

open

mgr/telemetry: very crashy clusters can break telemetry

Added by Lars Marowsky-Brée about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
telemetry module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a large production cluster where things weren't going too smoothly, we noticed telemetry data was no longer sent successfully. (Despite telemetry being enabled properly.)

While we're still looking at it in more detail, but I suspect that mgr/telemetry/module.py:gather_crashinfo() is the problem.

This always retrieves all crashes, not just those since the last successful telemetry upload or even limiting them to, say, the last 5 per node/daemon/N total.

I believe it should do all of those:

- only retrieve/look at crashes since the last send (or since telemetry was enabled)
(On the plus side, this would also greatly reduce the storage on the telemetry server)
- limit to the last 3 per daemon
- limit to 10 per cluster
- Add a "N crashes were omitted" flag if any of the two limits trigger

If we have daemons crashing more often, it's quite likely it's always the same problem and there's no benefit in sending 100s of the same signature.

Thoughts?

Actions #1

Updated by Lars Marowsky-Brée about 4 years ago

Indeed, disabling the crashes channel allowed the cluster in question to successfully report the rest of the telemetry data.

Actions #2

Updated by Yaarit Hatuka about 4 years ago

  • Backport set to nautilus

Lars Marowsky-Brée wrote:

On a large production cluster where things weren't going too smoothly, we noticed telemetry data was no longer sent successfully. (Despite telemetry being enabled properly.)

While we're still looking at it in more detail, but I suspect that mgr/telemetry/module.py:gather_crashinfo() is the problem.

This always retrieves all crashes, not just those since the last successful telemetry upload or even limiting them to, say, the last 5 per node/daemon/N total.

I believe it should do all of those:

- only retrieve/look at crashes since the last send (or since telemetry was enabled)
(On the plus side, this would also greatly reduce the storage on the telemetry server)

+1 We should fix this + backport to Nautilus.

- limit to the last 3 per daemon
- limit to 10 per cluster
- Add a "N crashes were omitted" flag if any of the two limits trigger

Sounds like the right thing to do in case it prevents successfully sending the rest of telemetry data. Not sure what the best cutoff would be.
We might want to have at least the signatures of the N omitted crashes for future reference?
Maybe try to send these N crashes whenever possible on a later date?

If we have daemons crashing more often, it's quite likely it's always the same problem and there's no benefit in sending 100s of the same signature.

+1 Maybe include a counter of these identical crashes so we have a better understanding of their frequency?

Thoughts?

Actions #3

Updated by Yaarit Hatuka about 4 years ago

Lars, do you know how many crashes were retrieved altogether at once, and how many of them were unique?

Actions

Also available in: Atom PDF