Project

General

Profile

Feature #12666

rgw: expose the number of *stuck threads* via admin socket

Added by Guang Yang over 8 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

With our Ceph cluster, we came across a couple of times that rgw only returned HTTP 500, which was due to the fact that all worker threads were stuck at something. I am wondering if we could expose the number of stuck workers via admin socket, and then we can have a watch dog daemon to restart radosgw once we detect all workers are stucked, to improvement the system availability.

After looking at the perf dump from rgw, the closest one is 'qlen', which reflects the qlen of the working queue. While it is close, but I think it is more robust/accurate to expose something directly for stuck threads.

Thoughts?

Associated revisions

Revision 15a3e866 (diff)
Added by Guang Yang over 8 years ago

rgw: enable perf counter for unhealthy workers

Fixes: #12666
Signed-off-by: Guang Yang <>

History

#2 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved

#3 Updated by John Spray about 6 years ago

  • Project changed from Ceph to rgw
  • Category deleted (22)

Bulk reassign of radosgw category to RGW project.

Also available in: Atom PDF