Bug #22895: radosgw not work after all rgw `RGWAsyncRadosProcessor had timed out - rgw - Ceph

Actions

Copy link

Bug #22895

closed

radosgw not work after all rgw `RGWAsyncRadosProcessor had timed out

Added by Amine Liu about 6 years ago. Updated almost 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

Ceph - v12.2.2

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.2

ceph-qa-suite:

rgw

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Read and write continuously, and this error occurs after `logrotate's killall -q -1 ...` every day.

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

2018-02-02 03:30:01.282755 7ef30f786700 1 civetweb: 0x7f0316fcc000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.344941 7ef329fbb700 1 ====== starting new request req=0x7ef329fb5190 =====
2018-02-02 03:30:01.376386 7ef329fbb700 1 ====== req done req=0x7ef329fb5190 op status=0 http_status=200 ======
2018-02-02 03:30:01.376962 7ef329fbb700 1 civetweb: 0x7f0316eb1000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0212/myobjects2 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.385675 7ef327fb7700 1 ====== req done req=0x7ef327fb1190 op status=0 http_status=200 ======
2018-02-02 03:30:01.385848 7ef327fb7700 1 civetweb: 0x7f0316ecf000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "PUT /mix4mbrwx0212/myobjects5026 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.394224 7ef327fb7700 1 ====== starting new request req=0x7ef327fb1190 =====
2018-02-02 03:30:01.435574 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=0 http_status=200 ======
2018-02-02 03:30:01.435653 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "PUT /mix4mbrwx0212/myobjects6717 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.441087 7ef31bf9f700 1 ====== starting new request req=0x7ef31bf99190 =====
2018-02-02 03:30:01.457514 7ef30f786700 1 ====== starting new request req=0x7ef30f780190 =====
2018-02-02 03:30:01.464434 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=0 http_status=200 ======
2018-02-02 03:30:01.465004 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0212/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.474509 7ef30f786700 1 ====== req done req=0x7ef30f780190 op status=0 http_status=200 ======
2018-02-02 03:30:01.475065 7ef30f786700 1 civetweb: 0x7f0316fcc000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects9 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.519088 7ef30c780700 1 ====== starting new request req=0x7ef30c77a190 =====
2018-02-02 03:30:01.544811 7ef30c780700 1 ====== req done req=0x7ef30c77a190 op status=0 http_status=200 ======
2018-02-02 03:30:01.545573 7ef30c780700 1 civetweb: 0x7f0316fea000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.625288 7ef30f786700 1 ====== starting new request req=0x7ef30f780190 =====
2018-02-02 03:30:01.629175 7ef332fcd700 -1 received signal: Hangup from PID: 1082590 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw UID: 0

2018-02-02 03:30:01.629285 7ef2fef65700 1 rgw realm reloader: Pausing frontends for realm update...
2018-02-02 03:30:01.631614 7ef329fbb700 1 ====== starting new request req=0x7ef329fb5190 =====
2018-02-02 03:30:01.639825 7ef31bf9f700 1 ====== starting new request req=0x7ef31bf99190 =====
2018-02-02 03:30:01.642634 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=-2 http_status=204 ======
2018-02-02 03:30:01.642726 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "DELETE /mix4mbrwx0212/myobjects8962 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.643210 7ef30f786700 1 ====== req done req=0x7ef30f780190 op status=0 http_status=200 ======
2018-02-02 03:30:01.644189 7ef30f786700 1 civetweb: 0x7f0316fcc000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects4 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.649796 7ef31bf9f700 1 ====== starting new request req=0x7ef31bf99190 =====
2018-02-02 03:30:01.663238 7ef329fbb700 1 ====== req done req=0x7ef329fb5190 op status=0 http_status=200 ======
2018-02-02 03:30:01.663784 7ef329fbb700 1 civetweb: 0x7f0316eb1000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0212/myobjects2 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.667091 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=0 http_status=200 ======
2018-02-02 03:30:01.667487 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.698675 7ef327fb7700 1 ====== req done req=0x7ef327fb1190 op status=0 http_status=200 ======
2018-02-02 03:30:01.698840 7ef2fef65700 1 rgw realm reloader: Frontends paused
2018-02-02 03:30:01.698819 7ef327fb7700 1 civetweb: 0x7f0316ecf000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "PUT /mix4mbrwx0211/myobjects9767 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:34:16.634502 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef3419f0700' had timed out after 600
2018-02-02 03:34:16.634531 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef349a00700' had timed out after 600
2018-02-02 03:34:21.634628 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef3419f0700' had timed out after 600
2018-02-02 03:34:21.634655 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef349a00700' had timed out after 600
2018-02-02 03:34:26.634797 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef3419f0700' had timed out after 600

Files

Download all files

master-ceph.conf (1.47 KB) master-ceph.conf	the rgw of master zone	Amine Liu, 02/02/2018 02:04 AM
slave-ceph.conf (1.31 KB) slave-ceph.conf	the rgw of slave zone	Amine Liu, 02/02/2018 02:05 AM
slave-node1-rgw-log.tar.bz2 (71.5 KB) slave-node1-rgw-log.tar.bz2	after sighup1, all slave nodes rgw was in crash.	Amine Liu, 02/06/2018 03:21 AM
master-node1-rgw-log.tar.bz2 (578 KB) master-node1-rgw-log.tar.bz2	master nodes was not in crash.	Amine Liu, 02/06/2018 03:22 AM

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by John Spray about 6 years ago

Project changed from Ceph to rgw

Actions

Copy link

Updated by Amine Liu about 6 years ago

2018-02-06 10:11:04.024052 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: operate()
2018-02-06 10:11:04.024388 7f77ce713700 20 cr:s=0x7f78f9b12d80:op=0x7f791b2c0000:18RGWMetaSyncShardCR: operate()
2018-02-06 10:11:04.024395 7f77ce713700 20 run: stack=0x7f78f9b12d80 is_blocked_by_stack()=0 is_sleeping=1 waiting_for_child()=0
2018-02-06 10:11:04.024560 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate()
2018-02-06 10:11:04.024624 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate()
2018-02-06 10:11:04.024860 7f77ce713700 20 enqueued request req=0x7f79144c6270
2018-02-06 10:11:04.024866 7f77ce713700 20 RGWWQ:
2018-02-06 10:11:04.024867 7f77ce713700 20 req: 0x7f79144c6270
2018-02-06 10:11:04.024873 7f77ce713700 20 run: stack=0x7f79753e3d40 is io blocked
2018-02-06 10:11:04.024973 7f77dcf30700 20 dequeued request req=0x7f79144c6270
2018-02-06 10:11:04.024983 7f77dcf30700 20 RGWWQ: empty
2018-02-06 10:11:04.029662 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate()
2018-02-06 10:11:04.029747 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate() returned r=-16
2018-02-06 10:11:04.029761 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: operate()
2018-02-06 10:11:04.029764 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: couldn't lock imu-shxxy.rgw.log:mdlog.sync-status.shard.11:syn
c_lock: retcode=-16
2018-02-06 10:11:04.029823 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: operate() returned r=-16
2018-02-06 10:11:04.029838 7f77ce713700 20 stack->operate() returned ret=-16
2018-02-06 10:11:04.029841 7f77ce713700 20 run: stack=0x7f79753e3d40 is done

[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1196288","cookie":"QBM_SjD111OM6jt","description":"","expiration":"2018-02-06 10:19:58.203940","addr":"172.18.216.129:0/4218311897"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#

Actions

Copy link

Updated by Amine Liu about 6 years ago

Is this the dead lock problem?

Actions

Copy link Download all files

Updated by Amine Liu about 6 years ago

File slave-node1-rgw-log.tar.bz2 slave-node1-rgw-log.tar.bz2 added
File master-node1-rgw-log.tar.bz2 master-node1-rgw-log.tar.bz2 added

I removed the logroute shell, and the next day I found that there were still timeout on slave nodes, but no hung.
But when I manually executed the logroute shell, all slave's rgw were hunged and in crash.

Actions

Copy link

Updated by Amine Liu about 6 years ago

The timeout may be due to inconsistencies between the two zones, Because that did not reappear again after I deleted those inconformity datas

http://tracker.ceph.com/issues/22804

Actions

Copy link

Updated by Casey Bodley about 6 years ago

the hangs in RGWAsyncRadosProcessor could be due to requests sent to another gateway that's also paused for realm reconfiguration. adding a timeout to the curl requests would probably help here

Actions

Copy link