Bug #37677
keystone s3 auth can deadlock rgw
0%
Description
With radosgw v12.2.8 we experience radosgw processes deadlocking when keystone s3 auth is enabled.
The rate of stuck rgw's seems to be related to the client load.
Symptoms of a deadlocked rgw include...
High cpu and loadavg:
# uptime 21:56:40 up 5 days, 10:57, 2 users, load average: 59.35, 37.68, 19.74 # top 232621 ceph 20 0 6671532 1.0g 16820 S 400.0 14.4 2101:47 radosgw
Connections to keystone host yyy stuck in CLOSE_WAIT:
tcp 1 0 xxx:54526 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54520 yyy:443 CLOSE_WAIT 232621/radosgw tcp 0 0 xxx:64260 yyy:443 ESTABLISHED 232621/radosgw tcp 348 0 xxx:63817 yyy:443 CLOSE_WAIT 232621/radosgw tcp 348 0 xxx:64098 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54496 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54522 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54506 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54508 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54524 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54510 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54486 yyy:443 CLOSE_WAIT 232621/radosgw tcp 348 0 xxx:54490 yyy:443 CLOSE_WAIT 232621/radosgw tcp 0 0 xxx:64270 yyy:443 ESTABLISHED 232621/radosgw tcp 1 0 xxx:54514 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54562 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54556 yyy:443 CLOSE_WAIT 232621/radosgw tcp 348 0 xxx:63812 yyy:443 CLOSE_WAIT 232621/radosgw tcp 348 0 xxx:64106 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54516 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54518 yyy:443 CLOSE_WAIT 232621/radosgw tcp 348 0 xxx:63816 yyy:443 CLOSE_WAIT 232621/radosgw tcp 1 0 xxx:54512 yyy:443 CLOSE_WAIT 232621/radosgw tcp 348 0 xxx:63808 yyy:443 CLOSE_WAIT 232621/radosgw tcp 0 0 xxx:64246 yyy:443 ESTABLISHED 232621/radosgw
rgw http client timing out or seeing "end of file":
2018-12-17 20:06:44.140587 7f4769b57700 0 curl_easy_perform returned status 35 error: Encountered end of file 2018-12-17 20:06:44.143957 7f48276fc700 0 curl_easy_perform returned status 35 error: Encountered end of file 2018-12-17 20:06:54.499552 7f4803eb5700 0 curl_easy_perform returned status 28 error: timed out before SSL handshake 2018-12-17 20:06:54.500168 7f4761346700 0 curl_easy_perform returned status 28 error: timed out before SSL handshake
Our rgw config is:
[client.rgw.cephxxx] nss db path = /var/lib/ceph/nss objecter inflight ops = 1024 rgw content length compat = true rgw dns name = s3.xxx rgw dns s3website name = s3-website.xxx rgw enable ops log = false rgw enable static website = true rgw enable usage log = false rgw expose bucket = true rgw frontends = civetweb port=[::]:8080 num_threads=512 error_log_file=/var/log/radosgw/civetweb.error.log rgw keystone admin domain = default rgw keystone admin project = services rgw keystone admin user = keysvcs3 rgw keystone api version = 3 rgw keystone revocation interval = 86400 rgw keystone url = https://yyy rgw lifecycle work time = 00:00-23:59 rgw num rados handles = 1 rgw print continue = false rgw remote addr param = HTTP_X_FORWARDED_FOR rgw resolve cname = true rgw s3 auth order = local, external rgw s3 auth use keystone = true rgw swift url = http://s3.xxx rgw thread pool size = 512 rgw keystone admin password = xxx
When rgw's get stuck like this, will need to `pkill -9 radosgw`.
Platform is Centos 7.6, with libcurl-7.29.0-51.el7.x86_64 and nss-3.36.0-7.el7_5.cern.x86_64.
History
#1 Updated by Dan van der Ster over 5 years ago
I've posted two gcore files of stuck radosgw's:
ceph-post-file: 02a8dac1-a29e-4dd0-91ea-9eee10309b5c
ceph-post-file: e7de70b6-3c00-4131-aaef-f20dcfed184f
In each we see >200 threads locked in RGWHTTPClient::process -> curl_easy_perform (querying keystone) e.g.:
Thread 564 (Thread 0x7f045d8a0700 (LWP 213133)): #0 0x00007f05868604ed in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f058685be01 in _L_lock_1093 () from /lib64/libpthread.so.0 #2 0x00007f058685bda2 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f057c8a1e19 in PR_Lock () from /lib64/libnspr4.so #4 0x00007f057c14c3cc in nssSlot_GetToken () from /lib64/libnss3.so #5 0x00007f057c146821 in nssTrustDomain_FindTrustForCertificate () from /lib64/libnss3.so #6 0x00007f057c14b0d2 in stan_GetCERTCertificate () from /lib64/libnss3.so #7 0x00007f057c141980 in CERT_NewTempCertificate () from /lib64/libnss3.so #8 0x00007f057c643e0f in ssl3_CompleteHandleCertificate () from /lib64/libssl3.so #9 0x00007f057c645ea7 in ssl3_HandleHandshakeMessage () from /lib64/libssl3.so #10 0x00007f057c648927 in ssl3_HandleNonApplicationData () from /lib64/libssl3.so #11 0x00007f057c6498d7 in ssl3_HandleRecord () from /lib64/libssl3.so #12 0x00007f057c64ac73 in ssl3_GatherCompleteHandshake () from /lib64/libssl3.so #13 0x00007f057c651feb in SSL_ForceHandshake () from /lib64/libssl3.so #14 0x00007f058730fb43 in nss_connect_common () from /lib64/libcurl.so.4 #15 0x00007f0587306aee in Curl_ssl_connect_nonblocking () from /lib64/libcurl.so.4 #16 0x00007f05872dd62e in https_connecting () from /lib64/libcurl.so.4 #17 0x00007f05873004d8 in multi_runsingle () from /lib64/libcurl.so.4 #18 0x00007f05873013b1 in curl_multi_perform () from /lib64/libcurl.so.4 #19 0x00007f05872f8623 in curl_easy_perform () from /lib64/libcurl.so.4 #20 0x000055d9cece9dea in RGWHTTPClient::process (this=this@entry=0x7f045d898050, method=method@entry=0x55d9cf002efe "POST", url=0x55d9df723a08 "https://keystone.cxx/v3/s3tokens") at /usr/src/debug/ceph-12.2.8/src/rgw/rgw_http_client.cc:423
#2 Updated by Matt Benjamin over 5 years ago
- Assignee set to Marcus Watts
@marcus, could you have a look at this?
Matt
#3 Updated by Beom-Seok Park about 5 years ago
Any updates?
I have the same problem with CentOS 7.6, ceph v12.2.11, libcurl-7.29.0-51.el7.x86_64 and nss-3.36.0-7.1.el7_6.x86_64
There are a lot of locked threads in RGWHTTPClient::process -> curl_easy_perform
Thread 108 (Thread 0x7f2c32852700 (LWP 100259)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f2c76872e01 in _L_lock_1093 () from /lib64/libpthread.so.0 #2 0x00007f2c76872da2 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55dd80c65340) at ../nptl/pthread_mutex_lock.c:133 #3 0x00007f2c6c8e8e19 in PR_Lock (lock=0x55dd80c65340) at ../../../nspr/pr/src/pthreads/ptsynch.c:171 #4 0x00007f2c77325b30 in nss_find_slot_by_name (slot_name=slot_name@entry=0x55dde45fe100 "PEM Token #0") at nss.c:359 #5 0x00007f2c772ee529 in nss_create_object (obj_class=obj_class@entry=1, filename=filename@entry=0x55ddd7979dd0 "/etc/pki/tls/certs/ca-bundle.crt", cacert=<optimized out>, ssl=0x55dddd61d9e0, ssl=0x55dddd61d9e0) at nss.c:387 #6 0x00007f2c772ee699 in nss_load_cert (ssl=0x55dddd61d9e0, filename=0x55ddd7979dd0 "/etc/pki/tls/certs/ca-bundle.crt", cacert=1) at nss.c:457 #7 0x00007f2c773275bd in nss_load_ca_certificates (sockindex=0, conn=0x55dddd61d800) at nss.c:1243 #8 nss_setup_connect (sockindex=0, conn=0x55dddd61d800) at nss.c:1499 #9 nss_connect_common (conn=conn@entry=0x55dddd61d800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3284984d) at nss.c:1672 #10 0x00007f2c77327d55 in Curl_nss_connect_nonblocking (conn=conn@entry=0x55dddd61d800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3284984d) at nss.c:1725 #11 0x00007f2c7731daee in Curl_ssl_connect_nonblocking (conn=conn@entry=0x55dddd61d800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3284984d) at sslgen.c:229 #12 0x00007f2c772f4ddd in https_connecting (done=0x7f2c3284984d, conn=0x55dddd61d800) at http.c:1352 #13 Curl_http_connect (conn=0x55dddd61d800, done=0x7f2c3284984d) at http.c:1322 #14 0x00007f2c77304735 in Curl_protocol_connect (conn=0x55dddd61d800, protocol_done=protocol_done@entry=0x7f2c3284984d) at url.c:3328 #15 0x00007f2c77317e5b in multi_runsingle (multi=multi@entry=0x55ddae7d4700, now=..., easy=easy@entry=0x55ddd7163500) at multi.c:1159 #16 0x00007f2c773183b1 in curl_multi_perform (multi_handle=multi_handle@entry=0x55ddae7d4700, running_handles=running_handles@entry=0x7f2c32849940) at multi.c:1757 #17 0x00007f2c7730f623 in curl_easy_perform (easy=0x55ddb93a8000) at easy.c:480 #18 0x000055dd7efb0faa in RGWHTTPClient::process (this=this@entry=0x7f2c3284a050, method=method@entry=0x55dd7f2e0f1e "POST", url=0x55ddd7162df8 "https://keystone.server/v3/s3tokens") at /usr/src/debug/ceph-12.2.11/src/rgw/rgw_http_client.cc:424
Thread 110 (Thread 0x7f2c31850700 (LWP 100261)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f2c76872e01 in _L_lock_1093 () from /lib64/libpthread.so.0 #2 0x00007f2c76872da2 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55dd80d57a20) at ../nptl/pthread_mutex_lock.c:133 #3 0x00007f2c6c8e8e19 in PR_Lock (lock=0x55dd80d57a20) at ../../../nspr/pr/src/pthreads/ptsynch.c:171 #4 0x00007f2c6c178059 in PK11_EnterSlotMonitor (slot=slot@entry=0x55dd86fa8380) at pk11slot.c:479 #5 0x00007f2c6c169271 in PK11_DestroyObject (slot=0x55dd86fa8380, object=323518) at pk11obj.c:54 #6 0x00007f2c6c16c039 in PK11_DestroyGenericObject (object=0x55dde7f57a40) at pk11obj.c:1590 #7 0x00007f2c77314eef in Curl_llist_remove (list=list@entry=0x55ddbbe0a040, e=0x55ddffde3040, user=user@entry=0x0) at llist.c:132 #8 0x00007f2c77314f8f in Curl_llist_destroy (list=0x55ddbbe0a040, user=0x0) at llist.c:149 #9 0x00007f2c773267e2 in nss_fail_connect (data=data@entry=0x55dddfbc0000, curlerr=CURLE_SSL_CONNECT_ERROR, connssl=0x55dde0fc69e0) at nss.c:1362 #10 0x00007f2c77326da6 in nss_do_connect (sockindex=0, conn=0x55dde0fc6800) at nss.c:1654 #11 nss_connect_common (conn=conn@entry=0x55dde0fc6800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3184784d) at nss.c:1685 #12 0x00007f2c77327d55 in Curl_nss_connect_nonblocking (conn=conn@entry=0x55dde0fc6800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3184784d) at nss.c:1725 #13 0x00007f2c7731daee in Curl_ssl_connect_nonblocking (conn=conn@entry=0x55dde0fc6800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3184784d) at sslgen.c:229 #14 0x00007f2c772f4ddd in https_connecting (done=0x7f2c3184784d, conn=0x55dde0fc6800) at http.c:1352 #15 Curl_http_connect (conn=0x55dde0fc6800, done=0x7f2c3184784d) at http.c:1322 #16 0x00007f2c77304735 in Curl_protocol_connect (conn=0x55dde0fc6800, protocol_done=protocol_done@entry=0x7f2c3184784d) at url.c:3328 #17 0x00007f2c77317e5b in multi_runsingle (multi=multi@entry=0x55ddd0cc40e0, now=..., easy=easy@entry=0x55dd95a28ea0) at multi.c:1159 #18 0x00007f2c773183b1 in curl_multi_perform (multi_handle=multi_handle@entry=0x55ddd0cc40e0, running_handles=running_handles@entry=0x7f2c31847940) at multi.c:1757 #19 0x00007f2c7730f623 in curl_easy_perform (easy=0x55dddfbc0000) at easy.c:480 #20 0x000055dd7efb0faa in RGWHTTPClient::process (this=this@entry=0x7f2c31848050, method=method@entry=0x55dd7f2e0f1e "POST", url=0x55dd95a29998 "https://keystone.server/v3/s3tokens") at /usr/src/debug/ceph-12.2.11/src/rgw/rgw_http_client.cc:424
.
.
.
Thread 4124 (Thread 0x7f24598a2700 (LWP 104276)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f2c76872e01 in _L_lock_1093 () from /lib64/libpthread.so.0 #2 0x00007f2c76872da2 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55dd80d57a20) at ../nptl/pthread_mutex_lock.c:133 #3 0x00007f2c6c8e8e19 in PR_Lock (lock=0x55dd80d57a20) at ../../../nspr/pr/src/pthreads/ptsynch.c:171 #4 0x00007f2c6c19318e in nssSlot_EnterMonitor (slot=slot@entry=0x55dd86fdc830) at devslot.c:48 #5 0x00007f2c6c19354c in nssSlot_GetToken (slot=0x55dd86fdc830) at devslot.c:288 #6 0x00007f2c6c18d3da in nssTrustDomain_FindCertificatesBySubject (td=td@entry=0x55dd80ac9030, subject=subject@entry=0x55dd8701e8f8, rvOpt=rvOpt@entry=0x0, maximumOpt=maximumOpt@entry=0, arenaOpt=arenaOpt@entry=0x55ddd90d2240) at trustdomain.c:605 #7 0x00007f2c6c18c05e in find_cert_issuer (cc=0x55dd80c60030, td=0x55dd80ac9030, policiesOpt=0x0, usage=0x7f2458896eb0, timeOpt=0x55de11151b10, c=0x55dd8701e888) at certificate.c:382 #8 nssCertificate_BuildChain (c=c@entry=0x55dd8701e888, timeOpt=timeOpt@entry=0x55de11151b10, usage=usage@entry=0x7f2458896f40, policiesOpt=policiesOpt@entry=0x0, rvOpt=rvOpt@entry=0x7f2458896f50, rvLimit=rvLimit@entry=2, arenaOpt=0x0, statusOpt=0x7f2458896f3c, td=0x55dd80ac9030, cc=0x55dd80c60030) at certificate.c:456 #9 0x00007f2c6c18c355 in NSSCertificate_BuildChain (c=c@entry=0x55dd8701e888, timeOpt=timeOpt@entry=0x55de11151b10, usage=usage@entry=0x7f2458896f40, policiesOpt=policiesOpt@entry=0x0, rvOpt=rvOpt@entry=0x7f2458896f50, rvLimit=rvLimit@entry=2, arenaOpt=arenaOpt@entry=0x0, statusOpt=statusOpt@entry=0x7f2458896f3c, td=td@entry=0x55dd80ac9030, cc=0x55dd80c60030) at certificate.c:501 #10 0x00007f2c6c1450b0 in CERT_FindCertIssuer (cert=cert@entry=0x55dd8701c020, validTime=validTime@entry=1552737975413268, usage=usage@entry=certUsageSSLServer) at certvfy.c:250 #11 0x00007f2c6c145659 in cert_VerifyCertChainOld (revoked=0x0, log=0x0, wincx=0x0, t=1552737975413268, certUsage=certUsageSSLServer, sigerror=0x0, checkSig=1, cert=0x55dd8701c020, handle=0x55dd80ac9030) at certvfy.c:689 #12 cert_VerifyCertChain (handle=handle@entry=0x55dd80ac9030, cert=cert@entry=0x55dd8701c020, checkSig=checkSig@entry=1, sigerror=sigerror@entry=0x0, certUsage=certUsage@entry=certUsageSSLServer, t=t@entry=1552737975413268, wincx=wincx@entry=0x0, log=log@entry=0x0, revoked=revoked@entry=0x0) at certvfy.c:949 #13 0x00007f2c6c145f68 in CERT_VerifyCertChain (handle=handle@entry=0x55dd80ac9030, cert=cert@entry=0x55dd8701c020, checkSig=checkSig@entry=1, certUsage=certUsage@entry=certUsageSSLServer, t=t@entry=1552737975413268, wincx=wincx@entry=0x0, log=log@entry=0x0) at certvfy.c:958 #14 0x00007f2c6c146c76 in cert_VerifyCertWithFlags (handle=handle@entry=0x55dd80ac9030, cert=0x55dd8701c020, checkSig=checkSig@entry=1, certUsage=certUsageSSLServer, t=t@entry=1552737975413268, flags=flags@entry=0, wincx=0x0, log=log@entry=0x0) at certvfy.c:1595 #15 0x00007f2c6c146e3a in CERT_VerifyCert (handle=handle@entry=0x55dd80ac9030, cert=<optimized out>, checkSig=checkSig@entry=1, certUsage=<optimized out>, t=t@entry=1552737975413268, wincx=<optimized out>, log=log@entry=0x0) at certvfy.c:1504 #16 0x00007f2c6c69273f in SSL_AuthCertificate (arg=0x55dd80ac9030, fd=<optimized out>, checkSig=1, isServer=0) at sslauth.c:273 #17 0x00007f2c6c68a860 in ssl3_AuthCertificate (ss=ss@entry=0x55dda0369000) at ssl3con.c:10388 #18 0x00007f2c6c68aeb5 in ssl3_CompleteHandleCertificate (ss=ss@entry=0x55dda0369000, b=0x55ddb9434b41 "ico8897b7xz7d\016", b@entry=0x55ddb9434004 "", length=0, length@entry=2877) at ssl3con.c:10343 #19 0x00007f2c6c68cea7 in ssl3_HandleCertificate (length=2877, b=0x55ddb9434004 "", ss=0x55dda0369000) at ssl3con.c:10211 #20 ssl3_HandlePostHelloHandshakeMessage (length=2877, b=0x55ddb9434004 "", ss=0x55dda0369000) at ssl3con.c:11405 #21 ssl3_HandleHandshakeMessage (ss=ss@entry=0x55dda0369000, b=0x55ddb9434004 "", length=2877, endOfRecord=1) at ssl3con.c:11352 #22 0x00007f2c6c68f927 in ssl3_HandleHandshake (origBuf=0x55dda0369288, ss=0x55dda0369000) at ssl3con.c:11533 #23 ssl3_HandleNonApplicationData (ss=ss@entry=0x55dda0369000, rType=<optimized out>, epoch=<optimized out>, seqNum=<optimized out>, databuf=databuf@entry=0x55dda0369288) at ssl3con.c:12047 #24 0x00007f2c6c6908d7 in ssl3_HandleRecord (ss=ss@entry=0x55dda0369000, cText=cText@entry=0x7f24588975a0, databuf=databuf@entry=0x55dda0369288) at ssl3con.c:12318 #25 0x00007f2c6c691c73 in ssl3_GatherCompleteHandshake (ss=ss@entry=0x55dda0369000, flags=flags@entry=0) at ssl3gthr.c:511 #26 0x00007f2c6c698feb in SSL_ForceHandshake (fd=<optimized out>) at sslsecur.c:399 #27 0x00007f2c6c699126 in SSL_ForceHandshakeWithTimeout (fd=<optimized out>, timeout=<optimized out>) at sslsecur.c:428 #28 0x00007f2c77326b43 in nss_do_connect (sockindex=0, conn=0x55ddf098d800) at nss.c:1619 #29 nss_connect_common (conn=conn@entry=0x55ddf098d800, sockindex=sockindex@entry=0, done=done@entry=0x7f245889784d) at nss.c:1685 #30 0x00007f2c77327d55 in Curl_nss_connect_nonblocking (conn=conn@entry=0x55ddf098d800, sockindex=sockindex@entry=0, done=done@entry=0x7f245889784d) at nss.c:1725 #31 0x00007f2c7731daee in Curl_ssl_connect_nonblocking (conn=conn@entry=0x55ddf098d800, sockindex=sockindex@entry=0, done=0x7f245889784d) at sslgen.c:229 #32 0x00007f2c772f462e in https_connecting (conn=0x55ddf098d800, done=<optimized out>) at http.c:1352 #33 0x00007f2c773174d8 in multi_runsingle (multi=multi@entry=0x55ddb6910c40, now=..., easy=easy@entry=0x55ddb0520de0) at multi.c:1204 #34 0x00007f2c773183b1 in curl_multi_perform (multi_handle=multi_handle@entry=0x55ddb6910c40, running_handles=running_handles@entry=0x7f2458897940) at multi.c:1757 #35 0x00007f2c7730f623 in curl_easy_perform (easy=0x55dd8e72e000) at easy.c:480 #36 0x000055dd7efb0faa in RGWHTTPClient::process (this=this@entry=0x7f2458898050, method=method@entry=0x55dd7f2e0f1e "POST", url=0x55ddc76cd2d8 "https://keystone.server/v3/s3tokens") at /usr/src/debug/ceph-12.2.11/src/rgw/rgw_http_client.cc:424
I temporarily changed the keystone url to http. As a result, the problem is gone.