Project

General

Profile

Actions

Bug #62000

open

rgw crashed on latest ceph version 17.2.6 quincy

Added by Oleksii Yermak 10 months ago. Updated about 2 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
07/13/2023
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On latest ceph version 17.2.6 quincy (stable) I got persistent error with crash radosgw process on all runned rgw. I have two rgw but it crashed simultaneously with a minimum load on the servers, while the radosgw process constantly consumes ~100%. We run rgw on "CentOS Linux release 8.5.2111" and "AlmaLinux release 8.8 (Sapphire Caracal)" therefore, I do not associate this with the operation of servers or operating systems. Example logs:

-34> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  1 ====== starting new request req=0x7f6e8a39c710 =====
   -33> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  2 req 12755806709950828565 0.000000000s initializing for trans_id = tx00000b105ba9e9a29d015-0064a3fd8d-5c7a0-eu-west-1
   -32> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  2 req 12755806709950828565 0.000000000s getting op 0
   -31> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  2 req 12755806709950828565 0.000000000s s3:get_obj verifying requester
   -30> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj normalizing buckets and tenants
   -29> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj init permissions
   -28> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj recalculating target
   -27> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj reading permissions
   -26> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  0 req 12755806709950828565 0.002999922s s3:get_obj WARNING: couldn't find acl header for object, generating default
   -25> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj init op
   -24> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj verifying op mask
   -23> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj verifying op permissions
   -22> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0) mask=49
   -21> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for uid=6016-5
   -20> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Found permission: 15
   -19> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=1 mask=49
   -18> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
   -17> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=2 mask=49
   -16> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
   -15> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0), owner=6016-5, perm=1
   -14> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj verifying op params
   -13> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj pre-executing
   -12> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj check rate limiting
   -11> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj executing
   -10> 2023-07-04T11:07:57.236+0000 7f6eee4e6700 -1 *** Caught signal (Aborted) **
 in thread 7f6eee4e6700 thread_name:radosgw

 ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f70c9e8fcf0]
 2: gsignal()
 3: abort()
 4: /lib64/libstdc++.so.6(+0x9009b) [0x7f70c8e7b09b]
 5: /lib64/libstdc++.so.6(+0x9653c) [0x7f70c8e8153c]
 6: /lib64/libstdc++.so.6(+0x95559) [0x7f70c8e80559]
 7: __gxx_personality_v0()
 8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f70c885fb03]
 9: _Unwind_Resume()
 10: /lib64/libradosgw.so.2(+0x538c5b) [0x7f70cc373c5b]
 11: /lib64/libradosgw.so.2(+0x63048a) [0x7f70cc46b48a]
 12: /lib64/libstdc++.so.6(+0xc2b13) [0x7f70c8eadb13]
 13: /lib64/libpthread.so.0(+0x81ca) [0x7f70c9e851ca]
 14: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -9> 2023-07-04T11:07:57.306+0000 7f706c7e2700  5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
    -8> 2023-07-04T11:07:57.306+0000 7f706c7e2700  5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
    -7> 2023-07-04T11:07:57.314+0000 7f709f26f700  5 RGW-SYNC:data:sync:shard[120]: failed to take lease
    -6> 2023-07-04T11:07:57.347+0000 7f6f795fc700  2 req 6774614862144446470 0.191995010s s3:put_obj completing
    -5> 2023-07-04T11:07:57.348+0000 7f6f795fc700  2 req 6774614862144446470 0.192994997s s3:put_obj op status=0
    -4> 2023-07-04T11:07:57.348+0000 7f6f795fc700  2 req 6774614862144446470 0.192994997s s3:put_obj http status=200
    -3> 2023-07-04T11:07:57.348+0000 7f6f795fc700  1 ====== req done req=0x7f6e8a41d710 op status=0 http_status=200 latency=0.192994997s ======
    -2> 2023-07-04T11:07:57.348+0000 7f6f795fc700  1 beast: 0x7f6e8a41d710: [IPv6 address] - 6016-5 [04/Jul/2023:11:07:57.155 +0000] "PUT /owncloud-prod/urn%3Aoid%3A2376416?partNumber=11&uploadId=2~sX-2sT0iBoilw73U4ziIIXNCeOPgniT HTTP/1.1" 200 5242880 - "aws-sdk-php/3.134.8 Guzzle/5.3.1 curl/7.29.0 PHP/7.4.24" - latency=0.192994997s
    -1> 2023-07-04T11:07:57.392+0000 7f70a4a7a700 10 monclient: tick
     0> 2023-07-04T11:07:57.765+0000 7f709f26f700  5 RGW-SYNC:data:sync:shard[119]: failed to take lease

On a second:

-34> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  1 ====== starting new request req=0x7f6e8a39c710 =====
   -33> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  2 req 12755806709950828565 0.000000000s initializing for trans_id = tx00000b105ba9e9a29d015-0064a3fd8d-5c7a0-eu-west-1
   -32> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  2 req 12755806709950828565 0.000000000s getting op 0
   -31> 2023-07-04T11:07:57.228+0000 7f6eee4e6700  2 req 12755806709950828565 0.000000000s s3:get_obj verifying requester
   -30> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj normalizing buckets and tenants
   -29> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj init permissions
   -28> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj recalculating target
   -27> 2023-07-04T11:07:57.229+0000 7f6eee4e6700  2 req 12755806709950828565 0.000999974s s3:get_obj reading permissions
   -26> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  0 req 12755806709950828565 0.002999922s s3:get_obj WARNING: couldn't find acl header for object, generating default
   -25> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj init op
   -24> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj verifying op mask
   -23> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj verifying op permissions
   -22> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0) mask=49
   -21> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for uid=6016-5
   -20> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Found permission: 15
   -19> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=1 mask=49
   -18> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
   -17> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Searching permissions for group=2 mask=49
   -16> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj Permissions for group not found
   -15> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  5 req 12755806709950828565 0.002999922s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=6016-5, acct_name=owncloud-prod, subuser=, perm_mask=15, is_admin=0), owner=6016-5, perm=1
   -14> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj verifying op params
   -13> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj pre-executing
   -12> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj check rate limiting
   -11> 2023-07-04T11:07:57.231+0000 7f6eee4e6700  2 req 12755806709950828565 0.002999922s s3:get_obj executing
   -10> 2023-07-04T11:07:57.236+0000 7f6eee4e6700 -1 *** Caught signal (Aborted) **
 in thread 7f6eee4e6700 thread_name:radosgw

 ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f70c9e8fcf0]
 2: gsignal()
 3: abort()
 4: /lib64/libstdc++.so.6(+0x9009b) [0x7f70c8e7b09b]
 5: /lib64/libstdc++.so.6(+0x9653c) [0x7f70c8e8153c]
 6: /lib64/libstdc++.so.6(+0x95559) [0x7f70c8e80559]
 7: __gxx_personality_v0()
 8: /lib64/libgcc_s.so.1(+0x10b03) [0x7f70c885fb03]
 9: _Unwind_Resume()
 10: /lib64/libradosgw.so.2(+0x538c5b) [0x7f70cc373c5b]
 11: /lib64/libradosgw.so.2(+0x63048a) [0x7f70cc46b48a]
 12: /lib64/libstdc++.so.6(+0xc2b13) [0x7f70c8eadb13]
 13: /lib64/libpthread.so.0(+0x81ca) [0x7f70c9e851ca]
 14: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -9> 2023-07-04T11:07:57.306+0000 7f706c7e2700  5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
    -8> 2023-07-04T11:07:57.306+0000 7f706c7e2700  5 req 6774614862144446470 0.150996074s s3:put_obj NOTICE: call to do_aws4_auth_completion
    -7> 2023-07-04T11:07:57.314+0000 7f709f26f700  5 RGW-SYNC:data:sync:shard[120]: failed to take lease
    -6> 2023-07-04T11:07:57.347+0000 7f6f795fc700  2 req 6774614862144446470 0.191995010s s3:put_obj completing
    -5> 2023-07-04T11:07:57.348+0000 7f6f795fc700  2 req 6774614862144446470 0.192994997s s3:put_obj op status=0
    -4> 2023-07-04T11:07:57.348+0000 7f6f795fc700  2 req 6774614862144446470 0.192994997s s3:put_obj http status=200
    -3> 2023-07-04T11:07:57.348+0000 7f6f795fc700  1 ====== req done req=0x7f6e8a41d710 op status=0 http_status=200 latency=0.192994997s ======
    -2> 2023-07-04T11:07:57.348+0000 7f6f795fc700  1 beast: 0x7f6e8a41d710: [IPv6 address] - 6016-5 [04/Jul/2023:11:07:57.155 +0000] "PUT /owncloud-prod/urn%3Aoid%3A2376416?partNumber=11&uploadId=2~sX-2sT0iBoilw73U4ziIIXNCeOPgniT HTTP/1.1" 200 5242880 - "aws-sdk-php/3.134.8 Guzzle/5.3.1 curl/7.29.0 PHP/7.4.24" - latency=0.192994997s
    -1> 2023-07-04T11:07:57.392+0000 7f70a4a7a700 10 monclient: tick
     0> 2023-07-04T11:07:57.765+0000 7f709f26f700  5 RGW-SYNC:data:sync:shard[119]: failed to take lease

We are running the latest version 17.2.6 on all mds, mgr, mon, osd, rgw nodes. I tried changing the default rgw_thread_pool_size up and down - but that didn't work.


Files

rgw.zip (277 KB) rgw.zip Oleksii Yermak, 08/10/2023 05:37 PM
Actions

Also available in: Atom PDF