Bug #17910
closedrgw: threads were stuck by discarded osd_op_reply message
0%
Description
Yesterday, an our customer got a problem on RGW.
After multiple server errors (500) were responded, the RGW was stuck.
The log shows the requests were executed but the RGW discarded a message from OSD (osd_op_reply). Then the threads were stuck to wait for completion of the operations from OSD until the RGW was restarted.
2016-11-14 01:42:15.187977 7f8514d7a700 1 ====== starting new request req=0x7f8604038160 ===== 2016-11-14 01:42:15.188024 7f8514d7a700 2 req 28187547:0.000048::GET /XXX/YYY/CurrentVersion.json::initializing for trans_id = tx000000000000001ae1b9b-00582897e7-612bb3-default 2016-11-14 01:42:15.188075 7f8514d7a700 2 req 28187547:0.000098:s3:GET /XXX/YYY/CurrentVersion.json::getting op 2016-11-14 01:42:15.188080 7f8514d7a700 2 req 28187547:0.000104:s3:GET /XXX/YYY/CurrentVersion.json:get_obj:authorizing 2016-11-14 01:42:15.188189 7f8514d7a700 2 req 28187547:0.000213:s3:GET /XXX/YYY/CurrentVersion.json:get_obj:reading permissions 2016-11-14 01:42:15.211395 7f84f6c01700 0 -- 10.10.25.10:0/1953002366 >> 10.10.25.103:6808/38623 pipe(0x7f86300c4710 sd=81 :33387 s=2 pgs=111745 cs=1 l=1 c=0x7f8630033e10).reader got old message 2 <= 4294967295 0x7f8618049770 osd_op_reply(4404528996 default.2705689.3756_YYY/CurrentVersion.json [getxattrs,stat,read 0~524288] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6, discarding ... 2016-11-14 01:52:18.206100 7f86435b3700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f8514d7a700' had timed out after 600 ... 2016-11-14 01:52:23.206291 7f86435b3700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f8514d7a700' had timed out after 600 ... (snip) ... 2016-11-14 03:34:43.637868 7f86435b3700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f8514d7a700' had timed out after 600 ... (many threads were stuck) ... 2016-11-14 03:34:44.732696 7f85cbfff700 1 handle_sigterm 2016-11-14 03:34:44.732765 7f85cbfff700 1 handle_sigterm set alarm for 120 2016-11-14 03:34:44.732782 7f86450fe820 -1 shutting down 2016-11-14 03:34:45.068648 7f85cbfff700 1 handle_sigterm 2016-11-14 03:34:48.157919 7fe773620820 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 32352
Apache error.log
[Mon Nov 14 01:43:45 2016] [error] [client xxx.xx.xx.xx] FastCGI: comm with server "/var/www/html/s3gw.fcgi" aborted: idle timeout (90 sec) [Mon Nov 14 01:43:45 2016] [error] [client xxx.xx.xx.xx] FastCGI: incomplete headers (0 bytes) received from server "/var/www/html/s3gw.fcgi"
I'm not sure if it was problem on RGW or OSD.
All the discarded messages were for the same OSD node (10.10.25.103), but there are no suspicious log on the OSD node. In addition, no error occurs after the RGW was restarted.
We plan to propose to set "rgw op thread suicide timeout" for tentative solution, but I would like to know root cause.
Environment:
RGW: v0.94.6 on CentOS 6.7 (Apache + FastCGI)
Ceph cluster (OSD): v0.94.3.3
Updated by Yehuda Sadeh over 7 years ago
- Project changed from rgw to Ceph
- Status changed from New to Duplicate
Updated by Yehuda Sadeh over 7 years ago
- Related to Bug #16122: msg: upper 32-bits of message sequence get lost added