Radosgw threads hangs indefinitely when a PG goes inactive
As per the current radosgw behaviour, for any read request a blocking call is made to the osd to fetch the object. But, in certain scenarios when the object is part of a PG that has gone inactive due to any reason (either all osds down for the object or osd count less than min_size), the radosgw thread hangs indefinitely waiting for the PG to become active. With multiple similar requests, all the radosgw threads gets exhausted soon and rgw is not able to serve any client requests which may have been targeted for active PGs. This creates a complete service unavailability.
P.S: This issue is faced in luminous version by us, although it could be reproduced in the master branch as well.
#2 Updated by Casey Bodley about 1 month ago
yeah, this needs some higher-level discussion and was raised on the ceph devel mailing list. radosgw calls into librados for osd requests, and librados will block indefinitely until a request can be satisfied. changing radosgw to time out on these requests would be complicated, but i agree that it's worth thinking about