Bug #17910: rgw: threads were stuck by discarded osd_op_reply message - Ceph - Ceph

Actions

Copy link

Bug #17910

closed

rgw: threads were stuck by discarded osd_op_reply message

Added by Osamu KIMURA over 7 years ago. Updated over 7 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Yesterday, an our customer got a problem on RGW.
After multiple server errors (500) were responded, the RGW was stuck.

The log shows the requests were executed but the RGW discarded a message from OSD (osd_op_reply). Then the threads were stuck to wait for completion of the operations from OSD until the RGW was restarted.

2016-11-14 01:42:15.187977 7f8514d7a700  1 ====== starting new request req=0x7f8604038160 =====
2016-11-14 01:42:15.188024 7f8514d7a700  2 req 28187547:0.000048::GET /XXX/YYY/CurrentVersion.json::initializing for trans_id = tx000000000000001ae1b9b-00582897e7-612bb3-default
2016-11-14 01:42:15.188075 7f8514d7a700  2 req 28187547:0.000098:s3:GET /XXX/YYY/CurrentVersion.json::getting op
2016-11-14 01:42:15.188080 7f8514d7a700  2 req 28187547:0.000104:s3:GET /XXX/YYY/CurrentVersion.json:get_obj:authorizing
2016-11-14 01:42:15.188189 7f8514d7a700  2 req 28187547:0.000213:s3:GET /XXX/YYY/CurrentVersion.json:get_obj:reading permissions
2016-11-14 01:42:15.211395 7f84f6c01700  0 -- 10.10.25.10:0/1953002366 >> 10.10.25.103:6808/38623 pipe(0x7f86300c4710 sd=81 :33387 s=2 pgs=111745 cs=1 l=1 c=0x7f8630033e10).reader got old message 2 <= 4294967295 0x7f8618049770 osd_op_reply(4404528996 default.2705689.3756_YYY/CurrentVersion.json [getxattrs,stat,read 0~524288] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6, discarding
...
2016-11-14 01:52:18.206100 7f86435b3700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f8514d7a700' had timed out after 600
...
2016-11-14 01:52:23.206291 7f86435b3700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f8514d7a700' had timed out after 600
... (snip) ...
2016-11-14 03:34:43.637868 7f86435b3700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f8514d7a700' had timed out after 600
... (many threads were stuck) ...
2016-11-14 03:34:44.732696 7f85cbfff700  1 handle_sigterm
2016-11-14 03:34:44.732765 7f85cbfff700  1 handle_sigterm set alarm for 120
2016-11-14 03:34:44.732782 7f86450fe820 -1 shutting down
2016-11-14 03:34:45.068648 7f85cbfff700  1 handle_sigterm
2016-11-14 03:34:48.157919 7fe773620820  0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 32352

Apache error.log

[Mon Nov 14 01:43:45 2016] [error] [client xxx.xx.xx.xx] FastCGI: comm with server "/var/www/html/s3gw.fcgi" aborted: idle timeout (90 sec)
[Mon Nov 14 01:43:45 2016] [error] [client xxx.xx.xx.xx] FastCGI: incomplete headers (0 bytes) received from server "/var/www/html/s3gw.fcgi"

I'm not sure if it was problem on RGW or OSD.
All the discarded messages were for the same OSD node (10.10.25.103), but there are no suspicious log on the OSD node. In addition, no error occurs after the RGW was restarted.

We plan to propose to set "rgw op thread suicide timeout" for tentative solution, but I would like to know root cause.

Environment:
RGW: v0.94.6 on CentOS 6.7 (Apache + FastCGI)
Ceph cluster (OSD): v0.94.3.3

Related issues 1 (0 open — 1 closed)