Bug #44323
openmgr/telemetry: very crashy clusters can break telemetry
0%
Description
On a large production cluster where things weren't going too smoothly, we noticed telemetry data was no longer sent successfully. (Despite telemetry being enabled properly.)
While we're still looking at it in more detail, but I suspect that mgr/telemetry/module.py:gather_crashinfo() is the problem.
This always retrieves all crashes, not just those since the last successful telemetry upload or even limiting them to, say, the last 5 per node/daemon/N total.
I believe it should do all of those:
- only retrieve/look at crashes since the last send (or since telemetry was enabled)
(On the plus side, this would also greatly reduce the storage on the telemetry server)
- limit to the last 3 per daemon
- limit to 10 per cluster
- Add a "N crashes were omitted" flag if any of the two limits trigger
If we have daemons crashing more often, it's quite likely it's always the same problem and there's no benefit in sending 100s of the same signature.
Thoughts?