rgw: expose the number of *stuck threads* via admin socket
With our Ceph cluster, we came across a couple of times that rgw only returned HTTP 500, which was due to the fact that all worker threads were stuck at something. I am wondering if we could expose the number of stuck workers via admin socket, and then we can have a watch dog daemon to restart radosgw once we detect all workers are stucked, to improvement the system availability.
After looking at the perf dump from rgw, the closest one is 'qlen', which reflects the qlen of the working queue. While it is close, but I think it is more robust/accurate to expose something directly for stuck threads.