Project

General

Profile

Bug #37677

keystone s3 auth can deadlock rgw

Added by Dan van der Ster almost 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

With radosgw v12.2.8 we experience radosgw processes deadlocking when keystone s3 auth is enabled.
The rate of stuck rgw's seems to be related to the client load.

Symptoms of a deadlocked rgw include...

High cpu and loadavg:

# uptime
 21:56:40 up 5 days, 10:57,  2 users,  load average: 59.35, 37.68, 19.74

# top
232621 ceph      20   0 6671532   1.0g  16820 S 400.0 14.4   2101:47 radosgw

Connections to keystone host yyy stuck in CLOSE_WAIT:

tcp        1      0 xxx:54526     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54520     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        0      0 xxx:64260     yyy:443       ESTABLISHED 232621/radosgw
tcp      348      0 xxx:63817     yyy:443       CLOSE_WAIT  232621/radosgw
tcp      348      0 xxx:64098     yyy:443       CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54496     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54522     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54506     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54508     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54524     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54510     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54486     yyy:443      CLOSE_WAIT  232621/radosgw
tcp      348      0 xxx:54490     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        0      0 xxx:64270     yyy:443       ESTABLISHED 232621/radosgw
tcp        1      0 xxx:54514     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54562     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54556     yyy:443      CLOSE_WAIT  232621/radosgw
tcp      348      0 xxx:63812     yyy:443       CLOSE_WAIT  232621/radosgw
tcp      348      0 xxx:64106     yyy:443       CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54516     yyy:443      CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54518     yyy:443      CLOSE_WAIT  232621/radosgw
tcp      348      0 xxx:63816     yyy:443       CLOSE_WAIT  232621/radosgw
tcp        1      0 xxx:54512     yyy:443      CLOSE_WAIT  232621/radosgw
tcp      348      0 xxx:63808     yyy:443       CLOSE_WAIT  232621/radosgw
tcp        0      0 xxx:64246     yyy:443       ESTABLISHED 232621/radosgw

rgw http client timing out or seeing "end of file":

2018-12-17 20:06:44.140587 7f4769b57700  0 curl_easy_perform returned status 35 error: Encountered end of file
2018-12-17 20:06:44.143957 7f48276fc700  0 curl_easy_perform returned status 35 error: Encountered end of file
2018-12-17 20:06:54.499552 7f4803eb5700  0 curl_easy_perform returned status 28 error: timed out before SSL handshake
2018-12-17 20:06:54.500168 7f4761346700  0 curl_easy_perform returned status 28 error: timed out before SSL handshake

Our rgw config is:

[client.rgw.cephxxx]
nss db path = /var/lib/ceph/nss
objecter inflight ops = 1024
rgw content length compat = true
rgw dns name = s3.xxx
rgw dns s3website name = s3-website.xxx
rgw enable ops log = false
rgw enable static website = true
rgw enable usage log = false
rgw expose bucket = true
rgw frontends = civetweb port=[::]:8080 num_threads=512 error_log_file=/var/log/radosgw/civetweb.error.log
rgw keystone admin domain = default
rgw keystone admin project = services
rgw keystone admin user = keysvcs3
rgw keystone api version = 3
rgw keystone revocation interval = 86400
rgw keystone url = https://yyy
rgw lifecycle work time = 00:00-23:59
rgw num rados handles = 1
rgw print continue = false
rgw remote addr param = HTTP_X_FORWARDED_FOR
rgw resolve cname = true
rgw s3 auth order = local, external
rgw s3 auth use keystone = true
rgw swift url = http://s3.xxx
rgw thread pool size = 512

rgw keystone admin password = xxx

When rgw's get stuck like this, will need to `pkill -9 radosgw`.

Platform is Centos 7.6, with libcurl-7.29.0-51.el7.x86_64 and nss-3.36.0-7.el7_5.cern.x86_64.

History

#1 Updated by Dan van der Ster almost 2 years ago

I've posted two gcore files of stuck radosgw's:

ceph-post-file: 02a8dac1-a29e-4dd0-91ea-9eee10309b5c
ceph-post-file: e7de70b6-3c00-4131-aaef-f20dcfed184f

In each we see >200 threads locked in RGWHTTPClient::process -> curl_easy_perform (querying keystone) e.g.:

Thread 564 (Thread 0x7f045d8a0700 (LWP 213133)):
#0  0x00007f05868604ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f058685be01 in _L_lock_1093 () from /lib64/libpthread.so.0
#2  0x00007f058685bda2 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f057c8a1e19 in PR_Lock () from /lib64/libnspr4.so
#4  0x00007f057c14c3cc in nssSlot_GetToken () from /lib64/libnss3.so
#5  0x00007f057c146821 in nssTrustDomain_FindTrustForCertificate () from /lib64/libnss3.so
#6  0x00007f057c14b0d2 in stan_GetCERTCertificate () from /lib64/libnss3.so
#7  0x00007f057c141980 in CERT_NewTempCertificate () from /lib64/libnss3.so
#8  0x00007f057c643e0f in ssl3_CompleteHandleCertificate () from /lib64/libssl3.so
#9  0x00007f057c645ea7 in ssl3_HandleHandshakeMessage () from /lib64/libssl3.so
#10 0x00007f057c648927 in ssl3_HandleNonApplicationData () from /lib64/libssl3.so
#11 0x00007f057c6498d7 in ssl3_HandleRecord () from /lib64/libssl3.so
#12 0x00007f057c64ac73 in ssl3_GatherCompleteHandshake () from /lib64/libssl3.so
#13 0x00007f057c651feb in SSL_ForceHandshake () from /lib64/libssl3.so
#14 0x00007f058730fb43 in nss_connect_common () from /lib64/libcurl.so.4
#15 0x00007f0587306aee in Curl_ssl_connect_nonblocking () from /lib64/libcurl.so.4
#16 0x00007f05872dd62e in https_connecting () from /lib64/libcurl.so.4
#17 0x00007f05873004d8 in multi_runsingle () from /lib64/libcurl.so.4
#18 0x00007f05873013b1 in curl_multi_perform () from /lib64/libcurl.so.4
#19 0x00007f05872f8623 in curl_easy_perform () from /lib64/libcurl.so.4
#20 0x000055d9cece9dea in RGWHTTPClient::process (this=this@entry=0x7f045d898050, method=method@entry=0x55d9cf002efe "POST", url=0x55d9df723a08 "https://keystone.cxx/v3/s3tokens") at /usr/src/debug/ceph-12.2.8/src/rgw/rgw_http_client.cc:423

#2 Updated by Matt Benjamin almost 2 years ago

  • Assignee set to Marcus Watts

@marcus, could you have a look at this?

Matt

#3 Updated by Beom-Seok Park over 1 year ago

Any updates?
I have the same problem with CentOS 7.6, ceph v12.2.11, libcurl-7.29.0-51.el7.x86_64 and nss-3.36.0-7.1.el7_6.x86_64
There are a lot of locked threads in RGWHTTPClient::process -> curl_easy_perform

Thread 108 (Thread 0x7f2c32852700 (LWP 100259)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2c76872e01 in _L_lock_1093 () from /lib64/libpthread.so.0
#2  0x00007f2c76872da2 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55dd80c65340) at ../nptl/pthread_mutex_lock.c:133
#3  0x00007f2c6c8e8e19 in PR_Lock (lock=0x55dd80c65340) at ../../../nspr/pr/src/pthreads/ptsynch.c:171
#4  0x00007f2c77325b30 in nss_find_slot_by_name (slot_name=slot_name@entry=0x55dde45fe100 "PEM Token #0") at nss.c:359
#5  0x00007f2c772ee529 in nss_create_object (obj_class=obj_class@entry=1, filename=filename@entry=0x55ddd7979dd0 "/etc/pki/tls/certs/ca-bundle.crt", cacert=<optimized out>, ssl=0x55dddd61d9e0, ssl=0x55dddd61d9e0) at nss.c:387
#6  0x00007f2c772ee699 in nss_load_cert (ssl=0x55dddd61d9e0, filename=0x55ddd7979dd0 "/etc/pki/tls/certs/ca-bundle.crt", cacert=1) at nss.c:457
#7  0x00007f2c773275bd in nss_load_ca_certificates (sockindex=0, conn=0x55dddd61d800) at nss.c:1243
#8  nss_setup_connect (sockindex=0, conn=0x55dddd61d800) at nss.c:1499
#9  nss_connect_common (conn=conn@entry=0x55dddd61d800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3284984d) at nss.c:1672
#10 0x00007f2c77327d55 in Curl_nss_connect_nonblocking (conn=conn@entry=0x55dddd61d800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3284984d) at nss.c:1725
#11 0x00007f2c7731daee in Curl_ssl_connect_nonblocking (conn=conn@entry=0x55dddd61d800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3284984d) at sslgen.c:229
#12 0x00007f2c772f4ddd in https_connecting (done=0x7f2c3284984d, conn=0x55dddd61d800) at http.c:1352
#13 Curl_http_connect (conn=0x55dddd61d800, done=0x7f2c3284984d) at http.c:1322
#14 0x00007f2c77304735 in Curl_protocol_connect (conn=0x55dddd61d800, protocol_done=protocol_done@entry=0x7f2c3284984d) at url.c:3328
#15 0x00007f2c77317e5b in multi_runsingle (multi=multi@entry=0x55ddae7d4700, now=..., easy=easy@entry=0x55ddd7163500) at multi.c:1159
#16 0x00007f2c773183b1 in curl_multi_perform (multi_handle=multi_handle@entry=0x55ddae7d4700, running_handles=running_handles@entry=0x7f2c32849940) at multi.c:1757
#17 0x00007f2c7730f623 in curl_easy_perform (easy=0x55ddb93a8000) at easy.c:480
#18 0x000055dd7efb0faa in RGWHTTPClient::process (this=this@entry=0x7f2c3284a050, method=method@entry=0x55dd7f2e0f1e "POST", url=0x55ddd7162df8 "https://keystone.server/v3/s3tokens") at /usr/src/debug/ceph-12.2.11/src/rgw/rgw_http_client.cc:424

Thread 110 (Thread 0x7f2c31850700 (LWP 100261)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2c76872e01 in _L_lock_1093 () from /lib64/libpthread.so.0
#2  0x00007f2c76872da2 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55dd80d57a20) at ../nptl/pthread_mutex_lock.c:133
#3  0x00007f2c6c8e8e19 in PR_Lock (lock=0x55dd80d57a20) at ../../../nspr/pr/src/pthreads/ptsynch.c:171
#4  0x00007f2c6c178059 in PK11_EnterSlotMonitor (slot=slot@entry=0x55dd86fa8380) at pk11slot.c:479
#5  0x00007f2c6c169271 in PK11_DestroyObject (slot=0x55dd86fa8380, object=323518) at pk11obj.c:54
#6  0x00007f2c6c16c039 in PK11_DestroyGenericObject (object=0x55dde7f57a40) at pk11obj.c:1590
#7  0x00007f2c77314eef in Curl_llist_remove (list=list@entry=0x55ddbbe0a040, e=0x55ddffde3040, user=user@entry=0x0) at llist.c:132
#8  0x00007f2c77314f8f in Curl_llist_destroy (list=0x55ddbbe0a040, user=0x0) at llist.c:149
#9  0x00007f2c773267e2 in nss_fail_connect (data=data@entry=0x55dddfbc0000, curlerr=CURLE_SSL_CONNECT_ERROR, connssl=0x55dde0fc69e0) at nss.c:1362
#10 0x00007f2c77326da6 in nss_do_connect (sockindex=0, conn=0x55dde0fc6800) at nss.c:1654
#11 nss_connect_common (conn=conn@entry=0x55dde0fc6800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3184784d) at nss.c:1685
#12 0x00007f2c77327d55 in Curl_nss_connect_nonblocking (conn=conn@entry=0x55dde0fc6800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3184784d) at nss.c:1725
#13 0x00007f2c7731daee in Curl_ssl_connect_nonblocking (conn=conn@entry=0x55dde0fc6800, sockindex=sockindex@entry=0, done=done@entry=0x7f2c3184784d) at sslgen.c:229
#14 0x00007f2c772f4ddd in https_connecting (done=0x7f2c3184784d, conn=0x55dde0fc6800) at http.c:1352
#15 Curl_http_connect (conn=0x55dde0fc6800, done=0x7f2c3184784d) at http.c:1322
#16 0x00007f2c77304735 in Curl_protocol_connect (conn=0x55dde0fc6800, protocol_done=protocol_done@entry=0x7f2c3184784d) at url.c:3328
#17 0x00007f2c77317e5b in multi_runsingle (multi=multi@entry=0x55ddd0cc40e0, now=..., easy=easy@entry=0x55dd95a28ea0) at multi.c:1159
#18 0x00007f2c773183b1 in curl_multi_perform (multi_handle=multi_handle@entry=0x55ddd0cc40e0, running_handles=running_handles@entry=0x7f2c31847940) at multi.c:1757
#19 0x00007f2c7730f623 in curl_easy_perform (easy=0x55dddfbc0000) at easy.c:480
#20 0x000055dd7efb0faa in RGWHTTPClient::process (this=this@entry=0x7f2c31848050, method=method@entry=0x55dd7f2e0f1e "POST", url=0x55dd95a29998 "https://keystone.server/v3/s3tokens") at /usr/src/debug/ceph-12.2.11/src/rgw/rgw_http_client.cc:424

.
.
.
Thread 4124 (Thread 0x7f24598a2700 (LWP 104276)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2c76872e01 in _L_lock_1093 () from /lib64/libpthread.so.0
#2  0x00007f2c76872da2 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55dd80d57a20) at ../nptl/pthread_mutex_lock.c:133
#3  0x00007f2c6c8e8e19 in PR_Lock (lock=0x55dd80d57a20) at ../../../nspr/pr/src/pthreads/ptsynch.c:171
#4  0x00007f2c6c19318e in nssSlot_EnterMonitor (slot=slot@entry=0x55dd86fdc830) at devslot.c:48
#5  0x00007f2c6c19354c in nssSlot_GetToken (slot=0x55dd86fdc830) at devslot.c:288
#6  0x00007f2c6c18d3da in nssTrustDomain_FindCertificatesBySubject (td=td@entry=0x55dd80ac9030, subject=subject@entry=0x55dd8701e8f8, rvOpt=rvOpt@entry=0x0, maximumOpt=maximumOpt@entry=0, arenaOpt=arenaOpt@entry=0x55ddd90d2240) at trustdomain.c:605
#7  0x00007f2c6c18c05e in find_cert_issuer (cc=0x55dd80c60030, td=0x55dd80ac9030, policiesOpt=0x0, usage=0x7f2458896eb0, timeOpt=0x55de11151b10, c=0x55dd8701e888) at certificate.c:382
#8  nssCertificate_BuildChain (c=c@entry=0x55dd8701e888, timeOpt=timeOpt@entry=0x55de11151b10, usage=usage@entry=0x7f2458896f40, policiesOpt=policiesOpt@entry=0x0, rvOpt=rvOpt@entry=0x7f2458896f50, rvLimit=rvLimit@entry=2, arenaOpt=0x0, statusOpt=0x7f2458896f3c, td=0x55dd80ac9030, cc=0x55dd80c60030) at certificate.c:456
#9  0x00007f2c6c18c355 in NSSCertificate_BuildChain (c=c@entry=0x55dd8701e888, timeOpt=timeOpt@entry=0x55de11151b10, usage=usage@entry=0x7f2458896f40, policiesOpt=policiesOpt@entry=0x0, rvOpt=rvOpt@entry=0x7f2458896f50, rvLimit=rvLimit@entry=2, arenaOpt=arenaOpt@entry=0x0, statusOpt=statusOpt@entry=0x7f2458896f3c, td=td@entry=0x55dd80ac9030, cc=0x55dd80c60030) at certificate.c:501
#10 0x00007f2c6c1450b0 in CERT_FindCertIssuer (cert=cert@entry=0x55dd8701c020, validTime=validTime@entry=1552737975413268, usage=usage@entry=certUsageSSLServer) at certvfy.c:250
#11 0x00007f2c6c145659 in cert_VerifyCertChainOld (revoked=0x0, log=0x0, wincx=0x0, t=1552737975413268, certUsage=certUsageSSLServer, sigerror=0x0, checkSig=1, cert=0x55dd8701c020, handle=0x55dd80ac9030) at certvfy.c:689
#12 cert_VerifyCertChain (handle=handle@entry=0x55dd80ac9030, cert=cert@entry=0x55dd8701c020, checkSig=checkSig@entry=1, sigerror=sigerror@entry=0x0, certUsage=certUsage@entry=certUsageSSLServer, t=t@entry=1552737975413268, wincx=wincx@entry=0x0, log=log@entry=0x0, revoked=revoked@entry=0x0) at certvfy.c:949
#13 0x00007f2c6c145f68 in CERT_VerifyCertChain (handle=handle@entry=0x55dd80ac9030, cert=cert@entry=0x55dd8701c020, checkSig=checkSig@entry=1, certUsage=certUsage@entry=certUsageSSLServer, t=t@entry=1552737975413268, wincx=wincx@entry=0x0, log=log@entry=0x0) at certvfy.c:958
#14 0x00007f2c6c146c76 in cert_VerifyCertWithFlags (handle=handle@entry=0x55dd80ac9030, cert=0x55dd8701c020, checkSig=checkSig@entry=1, certUsage=certUsageSSLServer, t=t@entry=1552737975413268, flags=flags@entry=0, wincx=0x0, log=log@entry=0x0) at certvfy.c:1595
#15 0x00007f2c6c146e3a in CERT_VerifyCert (handle=handle@entry=0x55dd80ac9030, cert=<optimized out>, checkSig=checkSig@entry=1, certUsage=<optimized out>, t=t@entry=1552737975413268, wincx=<optimized out>, log=log@entry=0x0) at certvfy.c:1504
#16 0x00007f2c6c69273f in SSL_AuthCertificate (arg=0x55dd80ac9030, fd=<optimized out>, checkSig=1, isServer=0) at sslauth.c:273
#17 0x00007f2c6c68a860 in ssl3_AuthCertificate (ss=ss@entry=0x55dda0369000) at ssl3con.c:10388
#18 0x00007f2c6c68aeb5 in ssl3_CompleteHandleCertificate (ss=ss@entry=0x55dda0369000, b=0x55ddb9434b41 "ico8897b7xz7d\016", b@entry=0x55ddb9434004 "", length=0, length@entry=2877) at ssl3con.c:10343
#19 0x00007f2c6c68cea7 in ssl3_HandleCertificate (length=2877, b=0x55ddb9434004 "", ss=0x55dda0369000) at ssl3con.c:10211
#20 ssl3_HandlePostHelloHandshakeMessage (length=2877, b=0x55ddb9434004 "", ss=0x55dda0369000) at ssl3con.c:11405
#21 ssl3_HandleHandshakeMessage (ss=ss@entry=0x55dda0369000, b=0x55ddb9434004 "", length=2877, endOfRecord=1) at ssl3con.c:11352
#22 0x00007f2c6c68f927 in ssl3_HandleHandshake (origBuf=0x55dda0369288, ss=0x55dda0369000) at ssl3con.c:11533
#23 ssl3_HandleNonApplicationData (ss=ss@entry=0x55dda0369000, rType=<optimized out>, epoch=<optimized out>, seqNum=<optimized out>, databuf=databuf@entry=0x55dda0369288) at ssl3con.c:12047
#24 0x00007f2c6c6908d7 in ssl3_HandleRecord (ss=ss@entry=0x55dda0369000, cText=cText@entry=0x7f24588975a0, databuf=databuf@entry=0x55dda0369288) at ssl3con.c:12318
#25 0x00007f2c6c691c73 in ssl3_GatherCompleteHandshake (ss=ss@entry=0x55dda0369000, flags=flags@entry=0) at ssl3gthr.c:511
#26 0x00007f2c6c698feb in SSL_ForceHandshake (fd=<optimized out>) at sslsecur.c:399
#27 0x00007f2c6c699126 in SSL_ForceHandshakeWithTimeout (fd=<optimized out>, timeout=<optimized out>) at sslsecur.c:428
#28 0x00007f2c77326b43 in nss_do_connect (sockindex=0, conn=0x55ddf098d800) at nss.c:1619
#29 nss_connect_common (conn=conn@entry=0x55ddf098d800, sockindex=sockindex@entry=0, done=done@entry=0x7f245889784d) at nss.c:1685
#30 0x00007f2c77327d55 in Curl_nss_connect_nonblocking (conn=conn@entry=0x55ddf098d800, sockindex=sockindex@entry=0, done=done@entry=0x7f245889784d) at nss.c:1725
#31 0x00007f2c7731daee in Curl_ssl_connect_nonblocking (conn=conn@entry=0x55ddf098d800, sockindex=sockindex@entry=0, done=0x7f245889784d) at sslgen.c:229
#32 0x00007f2c772f462e in https_connecting (conn=0x55ddf098d800, done=<optimized out>) at http.c:1352
#33 0x00007f2c773174d8 in multi_runsingle (multi=multi@entry=0x55ddb6910c40, now=..., easy=easy@entry=0x55ddb0520de0) at multi.c:1204
#34 0x00007f2c773183b1 in curl_multi_perform (multi_handle=multi_handle@entry=0x55ddb6910c40, running_handles=running_handles@entry=0x7f2458897940) at multi.c:1757
#35 0x00007f2c7730f623 in curl_easy_perform (easy=0x55dd8e72e000) at easy.c:480
#36 0x000055dd7efb0faa in RGWHTTPClient::process (this=this@entry=0x7f2458898050, method=method@entry=0x55dd7f2e0f1e "POST", url=0x55ddc76cd2d8 "https://keystone.server/v3/s3tokens") at /usr/src/debug/ceph-12.2.11/src/rgw/rgw_http_client.cc:424

I temporarily changed the keystone url to http. As a result, the problem is gone.

Also available in: Atom PDF