Bug #45796
openCeph mon's sporadically report slow ops
0%
Description
We have recently upgraded our cluster to 14.2.9 from 10.2.6 and are in the process of a rolling rebuild of many of the OSDs.
We have started seeing that our system will go into HEALTH_WARN sporadically due to slow ops on the mon's.
Looking into the log, the operations are always osd_pgtemp ops. Following along further, I can see the op in question coming from OSD; there seems to be a race as to which of several OSD's will take the PG. The OSD in question then loses this race and marks itself as "Stray".
My belief is that the Op on the Mon is no longer needed or valid as the epoch as moved forward and thus the slow op on the monitor should be discarded or something.
I have attached ceph-versions for our cluster, the dump of the monitors inflight ops and a snippet from the osd log showing the request to the mon. If i can gather any more diagnostic details please let me know.
Files