This ticket already has the priority of Urgent, but it clearly hasn't taken effect. I'm writing this comment to raise awareness of this gap. End-to-end test coverage for perf counters -> exporter daemon -> prometheus metrics is desperately needed because the thing keeps getting broken.
In April, 17.2.6 went out with a regression reported by a user in https://github.com/ceph/ceph/pull/50718. The immediate workaround was to disable exporter deployment in Rook (https://github.com/rook/rook/pull/12077) and later revert the corresponding changes in cephadm (https://github.com/ceph/ceph/pull/51053).
Pere addressed the root cause in exporter daemon in https://github.com/ceph/ceph/pull/51069. That PR was approved based on manual tests and the commitment to fulfill this ticket, see https://github.com/ceph/ceph/pull/51069#pullrequestreview-1399729291.
About a month later, in May, https://github.com/ceph/ceph/pull/50749 got merged into quincy and completely wrecked exporter daemon. Because there are no automated tests, this went unnoticed until yesterday -- just a day before the planned 17.2.7 release date. The revert happened in https://github.com/ceph/ceph/pull/54169 and builds would need to be respinned now.
Despite exporter being partially disabled in quincy, it's still something that is shipped. Further, and much more importantly, I don't think the situation with regards to automated testing is noticeably better in reef or main, which means that exporter is pretty much bound to break again in a release where it's enabled by default.
Avan, what can we do to actually prioritize this ticket?