Feature #12666
rgw: expose the number of *stuck threads* via admin socket
0%
Description
With our Ceph cluster, we came across a couple of times that rgw only returned HTTP 500, which was due to the fact that all worker threads were stuck at something. I am wondering if we could expose the number of stuck workers via admin socket, and then we can have a watch dog daemon to restart radosgw once we detect all workers are stucked, to improvement the system availability.
After looking at the perf dump from rgw, the closest one is 'qlen', which reflects the qlen of the working queue. While it is close, but I think it is more robust/accurate to expose something directly for stuck threads.
Thoughts?
Associated revisions
rgw: enable perf counter for unhealthy workers
Fixes: #12666
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
History
#1 Updated by Guang Yang about 8 years ago
#2 Updated by Sage Weil about 8 years ago
- Status changed from New to Resolved
#3 Updated by John Spray over 5 years ago
- Project changed from Ceph to rgw
- Category deleted (
22)
Bulk reassign of radosgw category to RGW project.