Project

General

Profile

Bug #58330

RGW service crashes regularly with floating point exception

Added by Jonathan Bayer over 1 year ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Target version:
% Done:

100%

Source:
Tags:
rgw crash backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

The RGW instances in our cluster are crashing regularly with this message. They can be restarted, but will crash after a short period even when not exposed to client requests. It seems to be down to our multi-site replication, but the other sites are fine. Here's the relevant log entries, what can we do to troubleshoot or figure out what the root cause is here .

 -338> 2022-12-09T18:04:32.916+0000 7f6980719700  1 ====== starting new request req=0x7f682c7e85c0 =====
 -337> 2022-12-09T18:04:32.916+0000 7f6980719700  2 req 9777807639303135910 0.000000000s initializing for trans_id = tx0000087b1c03e2ec3a2a6-00639378b0-b992fa-nvan
 -336> 2022-12-09T18:04:32.916+0000 7f6980719700  2 req 9777807639303135910 0.000000000s getting op 0
 -335> 2022-12-09T18:04:32.916+0000 7f6980719700  2 req 9777807639303135910 0.000000000s s3:get_obj verifying requester
 -334> 2022-12-09T18:04:32.916+0000 7f6980719700  2 req 9777807639303135910 0.000000000s s3:get_obj normalizing buckets and tenants
 -333> 2022-12-09T18:04:32.916+0000 7f6980719700  2 req 9777807639303135910 0.000000000s s3:get_obj init permissions
 -332> 2022-12-09T18:04:32.916+0000 7f6980719700  2 req 9777807639303135910 0.000000000s s3:get_obj recalculating target
 -331> 2022-12-09T18:04:32.916+0000 7f6980719700  2 req 9777807639303135910 0.000000000s s3:get_obj reading permissions
 -330> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj init op
 -329> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj verifying op mask
 -328> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj verifying op permissions
 -327> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0) mask=49
 -326> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for uid=synchronization-user
 -325> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Permissions for user not found
 -324> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for group=1 mask=49
 -323> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Permissions for group not found
 -322> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for group=2 mask=49
 -321> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Permissions for group not found
 -320> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0), owner=admin, perm=0
 -319> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0) mask=48
 -318> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for uid=synchronization-user
 -317> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Permissions for user not found
 -316> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for group=1 mask=48
 -315> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Permissions for group not found
 -314> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Searching permissions for group=2 mask=48
 -313> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj Permissions for group not found
 -312> 2022-12-09T18:04:32.927+0000 7f6993f40700  5 req 10732979668358407264 0.140001565s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0), owner=admin, perm=0
 -311> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 overriding permissions due to system operation
 -310> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj verifying op params
 -309> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj pre-executing
 -308> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj check rate limiting
 -307> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj executing
 -306> 2022-12-09T18:04:32.927+0000 7f6993f40700  2 req 10732979668358407264 0.140001565s s3:get_obj completing
 -305> 2022-12-09T18:04:32.928+0000 7f6993f40700  2 req 10732979668358407264 0.141001567s s3:get_obj op status=-2016
 -304> 2022-12-09T18:04:32.928+0000 7f6993f40700  2 req 10732979668358407264 0.141001567s s3:get_obj http status=304
 -303> 2022-12-09T18:04:32.928+0000 7f6993f40700  1 ====== req done req=0x7f682c35f5c0 op status=-2016 http_status=304 latency=0.141001567s ======
 -302> 2022-12-09T18:04:32.928+0000 7f6993f40700  1 beast: 0x7f682c35f5c0: 10.200.125.13 - synchronization-user [09/Dec/2022:18:04:32.787 +0000] "GET /mimic.da.spoolers.perf3/0OQNSl04%2F1517.hydozRLPKmLv0l4W.rnd?rgwx-zonegroup=56082fa6-e41f-40aa-9c45-b83b869effd1&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-skip-decrypt HTTP/1.1" 304 0 - - - latency=0.141001567s
 -301> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj init op
 -300> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj verifying op mask
 -299> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj verifying op permissions
 -298> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0) mask=49
 -297> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for uid=synchronization-user
 -296> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Permissions for user not found
 -295> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for group=1 mask=49
 -294> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Permissions for group not found
 -293> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for group=2 mask=49
 -292> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Permissions for group not found
 -291> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0), owner=admin, perm=0
 -290> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0) mask=48
 -289> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for uid=synchronization-user
 -288> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Permissions for user not found
 -287> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for group=1 mask=48
 -286> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Permissions for group not found
 -285> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Searching permissions for group=2 mask=48
 -284> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj Permissions for group not found
 -283> 2022-12-09T18:04:32.970+0000 7f6980f1a700  5 req 3767462689007815227 0.077000856s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0), owner=admin, perm=0
 -282> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 overriding permissions due to system operation
 -281> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj verifying op params
 -280> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj pre-executing
 -279> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj check rate limiting
 -278> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj executing
 -277> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj completing
 -276> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj op status=-2016
 -275> 2022-12-09T18:04:32.970+0000 7f6980f1a700  2 req 3767462689007815227 0.077000856s s3:get_obj http status=304
 -274> 2022-12-09T18:04:32.970+0000 7f6980f1a700  1 ====== req done req=0x7f6b6fde75c0 op status=-2016 http_status=304 latency=0.077000856s ======
 -273> 2022-12-09T18:04:32.971+0000 7f6980f1a700  1 beast: 0x7f6b6fde75c0: 10.200.125.14 - synchronization-user [09/Dec/2022:18:04:32.893 +0000] "GET /mimic.da.spoolers.perf3/1SJ1eyPp%2F7293.c)ZwwC9CAzfuj2vn.rnd?rgwx-zonegroup=56082fa6-e41f-40aa-9c45-b83b869effd1&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-skip-decrypt HTTP/1.1" 304 0 - - - latency=0.077000856s
 -272> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj init op
 -271> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj verifying op mask
 -270> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj verifying op permissions
 -269> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0) mask=49
 -268> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for uid=synchronization-user
 -267> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Permissions for user not found
 -266> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for group=1 mask=49
 -265> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Permissions for group not found
 -264> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for group=2 mask=49
 -263> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Permissions for group not found
 -262> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0), owner=admin, perm=0
 -261> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0) mask=48
 -260> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for uid=synchronization-user
 -259> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Permissions for user not found
 -258> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for group=1 mask=48
 -257> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Permissions for group not found
 -256> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Searching permissions for group=2 mask=48
 -255> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj Permissions for group not found
 -254> 2022-12-09T18:04:33.026+0000 7f6980719700  5 req 9777807639303135910 0.110001221s s3:get_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0), owner=admin, perm=0
 -253> 2022-12-09T18:04:33.026+0000 7f6980719700  2 overriding permissions due to system operation
 -252> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj verifying op params
 -251> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj pre-executing
 -250> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj check rate limiting
 -249> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj executing
 -248> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj completing
 -247> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj op status=-2016
 -246> 2022-12-09T18:04:33.026+0000 7f6980719700  2 req 9777807639303135910 0.110001221s s3:get_obj http status=304
 -245> 2022-12-09T18:04:33.026+0000 7f6980719700  1 ====== req done req=0x7f682c7e85c0 op status=-2016 http_status=304 latency=0.110001221s ======
 -244> 2022-12-09T18:04:33.026+0000 7f6980719700  1 beast: 0x7f682c7e85c0: 10.200.125.16 - synchronization-user [09/Dec/2022:18:04:32.916 +0000] "GET /mimic.da.spoolers.perf3/(Ja5Foko%2F10722.GggLHf2oGIRAcnhV.rnd?rgwx-zonegroup=56082fa6-e41f-40aa-9c45-b83b869effd1&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-skip-decrypt HTTP/1.1" 304 0 - - - latency=0.110001221s
 -243> 2022-12-09T18:04:33.056+0000 7f69aef76700  1 ====== starting new request req=0x7f682c5e45c0 =====
 -242> 2022-12-09T18:04:33.056+0000 7f69aef76700  2 req 9849936671185163943 0.000000000s initializing for trans_id = tx0000088b201371a15d6a7-00639378b1-b992fa-nvan
 -241> 2022-12-09T18:04:33.056+0000 7f69aef76700  2 req 9849936671185163943 0.000000000s getting op 0
 -240> 2022-12-09T18:04:33.056+0000 7f69aef76700  2 req 9849936671185163943 0.000000000s s3:get_obj verifying requester
 -239> 2022-12-09T18:04:33.056+0000 7f69aef76700  2 req 9849936671185163943 0.000000000s s3:get_obj normalizing buckets and tenants
 -238> 2022-12-09T18:04:33.056+0000 7f69aef76700  2 req 9849936671185163943 0.000000000s s3:get_obj init permissions
 -237> 2022-12-09T18:04:33.056+0000 7f69aef76700  2 req 9849936671185163943 0.000000000s s3:get_obj recalculating target
 -236> 2022-12-09T18:04:33.056+0000 7f69aef76700  2 req 9849936671185163943 0.000000000s s3:get_obj reading permissions
 -235> 2022-12-09T18:04:33.083+0000 7f697f717700  1 ====== starting new request req=0x7f682b94b5c0 =====
 -234> 2022-12-09T18:04:33.083+0000 7f697f717700  2 req 8762047531418507817 0.000000000s initializing for trans_id = tx0000079990bce7749da29-00639378b1-b992fa-nvan
 -233> 2022-12-09T18:04:33.083+0000 7f697f717700  2 req 8762047531418507817 0.000000000s getting op 0
 -232> 2022-12-09T18:04:33.083+0000 7f697f717700  2 req 8762047531418507817 0.000000000s s3:list_bucket verifying requester
 -231> 2022-12-09T18:04:33.083+0000 7f697f717700  2 req 8762047531418507817 0.000000000s s3:list_bucket normalizing buckets and tenants
 -230> 2022-12-09T18:04:33.083+0000 7f697f717700  2 req 8762047531418507817 0.000000000s s3:list_bucket init permissions
 -229> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket recalculating target
 -228> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket reading permissions
 -227> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket init op
 -226> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket verifying op mask
 -225> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket verifying op permissions
 -224> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0) mask=49
 -223> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket Searching permissions for uid=synchronization-user
 -222> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket Permissions for user not found
 -221> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket Searching permissions for group=1 mask=49
 -220> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket Permissions for group not found
 -219> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket Searching permissions for group=2 mask=49
 -218> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket Permissions for group not found
 -217> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  5 req 8762047531418507817 0.002000023s s3:list_bucket -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=synchronization-user, acct_name=synchronization-user, subuser=, perm_mask=15, is_admin=0), owner=admin, perm=0
 -216> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 overriding permissions due to system operation
 -215> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket verifying op params
 -214> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket pre-executing
 -213> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket check rate limiting
 -212> 2022-12-09T18:04:33.085+0000 7f69dcfd2700  2 req 8762047531418507817 0.002000023s s3:list_bucket executing
 -211> 2022-12-09T18:04:33.090+0000 7f69dcfd2700 -1 *** Caught signal (Floating point exception) **
in thread 7f69dcfd2700 thread_name:radosgw

ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
1: /lib64/libpthread.so.0(+0x12ce0) [0x7f6b6bcbbce0]
2: (RGWRados::calc_ordered_bucket_list_per_shard(unsigned int, unsigned int)+0xd) [0x7f6b6edd2b8d]
3: (RGWRados::cls_bucket_list_ordered(DoutPrefixProvider const*, RGWBucketInfo&, int, cls_rgw_obj_key const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, bool, unsigned short, boost::container::flat_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rgw_bucket_dir_entry, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, void>&, bool*, bool*, cls_rgw_obj_key*, optional_yield, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>)+0x1d2e) [0x7f6b6ee147ce]
4: (RGWRados::Bucket::List::list_objects_ordered(DoutPrefixProvider const*, long, std::vector<rgw_bucket_dir_entry, std::allocator<rgw_bucket_dir_entry> >*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, bool> > >*, bool*, optional_yield)+0x787) [0x7f6b6ee16267]
5: (rgw::sal::RadosBucket::list(DoutPrefixProvider const*, rgw::sal::Bucket::ListParams&, int, rgw::sal::Bucket::ListResults&, optional_yield)+0x3fa) [0x7f6b6ef0a46a]
6: (RGWListBucket::execute(optional_yield)+0x347) [0x7f6b6ed7e697]
7: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Store*, bool)+0xb3f) [0x7f6b6e9d969f]
8: (process_request(rgw::sal::Store*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSink*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, std::shared_ptr<RateLimiter>, int*)+0x2616) [0x7f6b6e9dc606]
9: /lib64/libradosgw.so.2(+0x65c62d) [0x7f6b6e94b62d]
10: make_fcontext()

NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this

Related issues

Copied to rgw - Backport #58478: pacific: RGW service crashes regularly with floating point exception Resolved
Copied to rgw - Backport #58479: quincy: RGW service crashes regularly with floating point exception Resolved

History

#1 Updated by Casey Bodley about 1 year ago

  • Assignee set to J. Eric Ivancich

#2 Updated by J. Eric Ivancich about 1 year ago

  • Status changed from New to Fix Under Review
  • Target version set to v18.0.0
  • Backport set to quincy,pacific
  • Pull request ID set to 49648

I've submitted a fix for the proximate issue -- the floating point exception. We'll still need to figure out why the system believes there are 0 shards.

#3 Updated by J. Eric Ivancich about 1 year ago

To start digging into the underlying issue, let me ask you to execute two commands. Assuming your bucket is named "bkt1" (you'll need to substitute), first issue:

$ radosgw-admin metadata get bucket:bkt1

In the resulting JSON, you'll find the field bucket_id, please note that as you'll need it in the next command.

"bucket_id": "d66f5e12-44a4-4997-8cef-ff67c6de85e7.4147.1",

Now issue:

$ radosgw-admin metadata get bucket.instance:bkt1:d66f5e12-44a4-4997-8cef-ff67c6de85e7.4147.1

Again, you'll substitute the bucket name and bucket_id.

Please provide the output of that. One field of interest is "num_shards".

Thanks!

#4 Updated by Jonathan Bayer about 1 year ago

I believe this is the bucket responsible. It's a very large bucket intended to store > 1 billion objects. We're currently using this for testing for a use-case later in the year.

{
    "key": "bucket.instance:mimic.da.spoolers.perf3:c30362c1-296e-4beb-953b-e7a505ce8c3b.4596016.1",
    "ver": {
        "tag": "_Vo61gFoJwC00IPfFxC1q_JP",
        "ver": 10
    },
    "mtime": "2022-12-19T19:25:20.157277Z",
    "data": {
        "bucket_info": {
            "bucket": {
                "name": "mimic.da.spoolers.perf3",
                "marker": "c30362c1-296e-4beb-953b-e7a505ce8c3b.4596016.1",
                "bucket_id": "c30362c1-296e-4beb-953b-e7a505ce8c3b.4596016.1",
                "tenant": "",
                "explicit_placement": {
                    "data_pool": "",
                    "data_extra_pool": "",
                    "index_pool": "" 
                }
            },
            "creation_time": "2022-11-11T16:49:08.776118Z",
            "owner": "admin",
            "flags": 0,
            "zonegroup": "56082fa6-e41f-40aa-9c45-b83b869effd1",
            "placement_rule": "default-placement",
            "has_instance_obj": "false",
            "quota": {
                "enabled": false,
                "check_on_raw": false,
                "max_size": -1,
                "max_size_kb": 0,
                "max_objects": -1
            },
            "num_shards": 40000,
            "bi_shard_hash_type": 0,
            "requester_pays": "false",
            "has_website": "false",
            "swift_versioning": "false",
            "swift_ver_location": "",
            "index_type": 0,
            "mdsearch_config": [],
            "reshard_status": 0,
            "new_bucket_instance_id": "" 
        },
        "attrs": [
            {
                "key": "user.rgw.acl",
                "val": "AgKBAAAAAwISAAAABQAAAGFkbWluBQAAAGFkbWluBANjAAAAAQEAAAAFAAAAYWRtaW4PAAAAAQAAAAUAAABhZG1pbgUDNgAAAAICBAAAAAAAAAAFAAAAYWRtaW4AAAAAAAAAAAICBAAAAA8AAAAFAAAAYWRtaW4AAAAAAAAAAAAAAAAAAAAA" 
            },
            {
                "key": "user.rgw.lc",
                "val": "AQGZAAAAAQAAABcAAABXYXJwRXhwaXJhdGlvbkJlbmNobWFyawYBdAAAABcAAABXYXJwRXhwaXJhdGlvbkJlbmNobWFyawAAAAAHAAAARW5hYmxlZAMCCgAAAAIAAAAxOAAAAAADAggAAAAAAAAAAAAAAAMCCQAAAAEAAAAxAAAAAAACAQ4AAAAAAAAAAQEEAAAAAAAAAAAAAAAAAAAA" 
            }
        ]
    }
}

#5 Updated by J. Eric Ivancich about 1 year ago

Hey Jonathan,

Was that command issued in the zone (using the same RGW) as where the crashes occur? The reason I ask is that is that the floating point error would seem to be division by zero, which would mean the number of shards is zero. But the output you shared shows the number of shards as 40000.

Thanks,

Eric

#6 Updated by Jonathan Bayer about 1 year ago

Hi Eric,

It was indeed, and when checking the rest of the buckets on the system we don't have any with zero shards.

grep num_shards /home/jbayer/buckstats 
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 11,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 11,
        "num_shards": 1001,
        "num_shards": 11,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 40000,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,
        "num_shards": 1001,

Can you think of any other combination of things that could lead to these crashes?

#7 Updated by J. Eric Ivancich about 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#8 Updated by Backport Bot about 1 year ago

  • Copied to Backport #58478: pacific: RGW service crashes regularly with floating point exception added

#9 Updated by Backport Bot about 1 year ago

  • Copied to Backport #58479: quincy: RGW service crashes regularly with floating point exception added

#10 Updated by Backport Bot about 1 year ago

  • Tags changed from rgw crash to rgw crash backport_processed

#11 Updated by Steven Goodliff 7 months ago

Hi,

I'm also seeing this error on 17.2.5, is there a good way to determine which bucket seems to have zero shards ? if that is the underlying cause.

thanks

#12 Updated by Konstantin Shalygin 4 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF