Project

General

Profile

Bug #22895

radosgw not work after all rgw `RGWAsyncRadosProcessor had timed out

Added by Amine Liu over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Read and write continuously, and this error occurs after `logrotate's killall -q -1 ...` every day.

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

2018-02-02 03:30:01.282755 7ef30f786700 1 civetweb: 0x7f0316fcc000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.344941 7ef329fbb700 1 ====== starting new request req=0x7ef329fb5190 =====
2018-02-02 03:30:01.376386 7ef329fbb700 1 ====== req done req=0x7ef329fb5190 op status=0 http_status=200 ======
2018-02-02 03:30:01.376962 7ef329fbb700 1 civetweb: 0x7f0316eb1000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0212/myobjects2 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.385675 7ef327fb7700 1 ====== req done req=0x7ef327fb1190 op status=0 http_status=200 ======
2018-02-02 03:30:01.385848 7ef327fb7700 1 civetweb: 0x7f0316ecf000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "PUT /mix4mbrwx0212/myobjects5026 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.394224 7ef327fb7700 1 ====== starting new request req=0x7ef327fb1190 =====
2018-02-02 03:30:01.435574 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=0 http_status=200 ======
2018-02-02 03:30:01.435653 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "PUT /mix4mbrwx0212/myobjects6717 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.441087 7ef31bf9f700 1 ====== starting new request req=0x7ef31bf99190 =====
2018-02-02 03:30:01.457514 7ef30f786700 1 ====== starting new request req=0x7ef30f780190 =====
2018-02-02 03:30:01.464434 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=0 http_status=200 ======
2018-02-02 03:30:01.465004 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0212/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.474509 7ef30f786700 1 ====== req done req=0x7ef30f780190 op status=0 http_status=200 ======
2018-02-02 03:30:01.475065 7ef30f786700 1 civetweb: 0x7f0316fcc000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects9 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.519088 7ef30c780700 1 ====== starting new request req=0x7ef30c77a190 =====
2018-02-02 03:30:01.544811 7ef30c780700 1 ====== req done req=0x7ef30c77a190 op status=0 http_status=200 ======
2018-02-02 03:30:01.545573 7ef30c780700 1 civetweb: 0x7f0316fea000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.625288 7ef30f786700 1 ====== starting new request req=0x7ef30f780190 =====
2018-02-02 03:30:01.629175 7ef332fcd700 -1 received signal: Hangup from PID: 1082590 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw UID: 0

2018-02-02 03:30:01.629285 7ef2fef65700 1 rgw realm reloader: Pausing frontends for realm update...
2018-02-02 03:30:01.631614 7ef329fbb700 1 ====== starting new request req=0x7ef329fb5190 =====
2018-02-02 03:30:01.639825 7ef31bf9f700 1 ====== starting new request req=0x7ef31bf99190 =====
2018-02-02 03:30:01.642634 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=-2 http_status=204 ======
2018-02-02 03:30:01.642726 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "DELETE /mix4mbrwx0212/myobjects8962 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.643210 7ef30f786700 1 ====== req done req=0x7ef30f780190 op status=0 http_status=200 ======
2018-02-02 03:30:01.644189 7ef30f786700 1 civetweb: 0x7f0316fcc000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects4 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.649796 7ef31bf9f700 1 ====== starting new request req=0x7ef31bf99190 =====
2018-02-02 03:30:01.663238 7ef329fbb700 1 ====== req done req=0x7ef329fb5190 op status=0 http_status=200 ======
2018-02-02 03:30:01.663784 7ef329fbb700 1 civetweb: 0x7f0316eb1000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0212/myobjects2 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.667091 7ef31bf9f700 1 ====== req done req=0x7ef31bf99190 op status=0 http_status=200 ======
2018-02-02 03:30:01.667487 7ef31bf9f700 1 civetweb: 0x7f0316f4b000: 172.26.99.233 - - [01/Feb/2018:17:01:20 +0800] "GET /mix4mbrwx0211/myobjects8 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:30:01.698675 7ef327fb7700 1 ====== req done req=0x7ef327fb1190 op status=0 http_status=200 ======
2018-02-02 03:30:01.698840 7ef2fef65700 1 rgw realm reloader: Frontends paused
2018-02-02 03:30:01.698819 7ef327fb7700 1 civetweb: 0x7f0316ecf000: 172.26.99.236 - - [01/Feb/2018:17:01:20 +0800] "PUT /mix4mbrwx0211/myobjects9767 HTTP/1.1" 1 0 - aws-sdk-java/1.4.0.1 Linux/3.10.0-327.22.2.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02
2018-02-02 03:34:16.634502 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef3419f0700' had timed out after 600
2018-02-02 03:34:16.634531 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef349a00700' had timed out after 600
2018-02-02 03:34:21.634628 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef3419f0700' had timed out after 600
2018-02-02 03:34:21.634655 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef349a00700' had timed out after 600
2018-02-02 03:34:26.634797 7f02f314f700 1 heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7ef3419f0700' had timed out after 600

master-ceph.conf View - the rgw of master zone (1.47 KB) Amine Liu, 02/02/2018 02:04 AM

slave-ceph.conf View - the rgw of slave zone (1.31 KB) Amine Liu, 02/02/2018 02:05 AM

slave-node1-rgw-log.tar.bz2 - after sighup1, all slave nodes rgw was in crash. (71.5 KB) Amine Liu, 02/06/2018 03:21 AM

master-node1-rgw-log.tar.bz2 - master nodes was not in crash. (578 KB) Amine Liu, 02/06/2018 03:22 AM


Related issues

Related to rgw - Bug #22775: rgw: multisite: the huge object sync is slow in unstable network environment Fix Under Review 01/23/2018
Related to rgw - Backport #22365: luminous: log rotate causes rgw realm reload Resolved

History

#1 Updated by John Spray over 3 years ago

  • Project changed from Ceph to rgw

#2 Updated by Amine Liu over 3 years ago

2018-02-06 10:11:04.024052 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: operate()
2018-02-06 10:11:04.024388 7f77ce713700 20 cr:s=0x7f78f9b12d80:op=0x7f791b2c0000:18RGWMetaSyncShardCR: operate()
2018-02-06 10:11:04.024395 7f77ce713700 20 run: stack=0x7f78f9b12d80 is_blocked_by_stack()=0 is_sleeping=1 waiting_for_child()=0
2018-02-06 10:11:04.024560 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate()
2018-02-06 10:11:04.024624 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate()
2018-02-06 10:11:04.024860 7f77ce713700 20 enqueued request req=0x7f79144c6270
2018-02-06 10:11:04.024866 7f77ce713700 20 RGWWQ:
2018-02-06 10:11:04.024867 7f77ce713700 20 req: 0x7f79144c6270
2018-02-06 10:11:04.024873 7f77ce713700 20 run: stack=0x7f79753e3d40 is io blocked
2018-02-06 10:11:04.024973 7f77dcf30700 20 dequeued request req=0x7f79144c6270
2018-02-06 10:11:04.024983 7f77dcf30700 20 RGWWQ: empty
2018-02-06 10:11:04.029662 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate()
2018-02-06 10:11:04.029747 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f79582cd600:20RGWSimpleRadosLockCR: operate() returned r=-16
2018-02-06 10:11:04.029761 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: operate()
2018-02-06 10:11:04.029764 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: couldn't lock imu-shxxy.rgw.log:mdlog.sync-status.shard.11:syn
c_lock: retcode=-16
2018-02-06 10:11:04.029823 7f77ce713700 20 cr:s=0x7f79753e3d40:op=0x7f7913bc1800:20RGWContinuousLeaseCR: operate() returned r=-16
2018-02-06 10:11:04.029838 7f77ce713700 20 stack->operate() returned ret=-16
2018-02-06 10:11:04.029841 7f77ce713700 20 run: stack=0x7f79753e3d40 is done

[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1210710","cookie":"VtYSbREzuEYKHnG","description":"","expiration":"2018-02-06 10:17:58.081622","addr":"172.18.216.97:0/3841642670"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]#
[root@xxy-2f201r-ceph-s3-c1-03 ~]# rados -p imu-shxxy.rgw.log lock info mdlog.sync-status.shard.11 sync_lock {"name":"sync_lock","type":"exclusive","tag":"","lockers":[{"name":"client.1196288","cookie":"QBM_SjD111OM6jt","description":"","expiration":"2018-02-06 10:19:58.203940","addr":"172.18.216.129:0/4218311897"}]}[root@xxy-2f201r-ceph-s3-c1-03 ~]#

#3 Updated by Amine Liu over 3 years ago

Is this the dead lock problem?

#4 Updated by Amine Liu over 3 years ago

I removed the logroute shell, and the next day I found that there were still timeout on slave nodes, but no hung.
But when I manually executed the logroute shell, all slave's rgw were hunged and in crash.

#5 Updated by Amine Liu over 3 years ago

The timeout may be due to inconsistencies between the two zones, Because that did not reappear again after I deleted those inconformity datas

http://tracker.ceph.com/issues/22804

#6 Updated by Casey Bodley over 3 years ago

the hangs in RGWAsyncRadosProcessor could be due to requests sent to another gateway that's also paused for realm reconfiguration. adding a timeout to the curl requests would probably help here

#7 Updated by Casey Bodley over 3 years ago

  • Related to Bug #22775: rgw: multisite: the huge object sync is slow in unstable network environment added

#8 Updated by Yehuda Sadeh over 3 years ago

  • Related to Backport #22365: luminous: log rotate causes rgw realm reload added

#9 Updated by Yehuda Sadeh over 3 years ago

Abhishek pointed out that it could be a duplicate of this bug that was fixed in 12.2.3:
http://tracker.ceph.com/issues/22365

#10 Updated by Orit Wasserman about 3 years ago

  • Status changed from New to Triaged

#11 Updated by Amine Liu about 3 years ago

Yehuda Sadeh wrote:

Abhishek pointed out that it could be a duplicate of this bug that was fixed in 12.2.3:
http://tracker.ceph.com/issues/22365

yes,it fixed

#12 Updated by Abhishek Lekshmanan about 3 years ago

  • Status changed from Triaged to Resolved

Also available in: Atom PDF