Project

General

Profile

Actions

Bug #44323

open

mgr/telemetry: very crashy clusters can break telemetry

Added by Lars Marowsky-Brée about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
telemetry module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a large production cluster where things weren't going too smoothly, we noticed telemetry data was no longer sent successfully. (Despite telemetry being enabled properly.)

While we're still looking at it in more detail, but I suspect that mgr/telemetry/module.py:gather_crashinfo() is the problem.

This always retrieves all crashes, not just those since the last successful telemetry upload or even limiting them to, say, the last 5 per node/daemon/N total.

I believe it should do all of those:

- only retrieve/look at crashes since the last send (or since telemetry was enabled)
(On the plus side, this would also greatly reduce the storage on the telemetry server)
- limit to the last 3 per daemon
- limit to 10 per cluster
- Add a "N crashes were omitted" flag if any of the two limits trigger

If we have daemons crashing more often, it's quite likely it's always the same problem and there's no benefit in sending 100s of the same signature.

Thoughts?

Actions

Also available in: Atom PDF