Project

General

Profile

Actions

Bug #64244

open

Admin Ops socket crashes RGW

Added by Marc Singer 4 months ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy reef squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am encountering a problem with the RGW Admin Ops Socket.

I am setting up the socket as follows:

rgw_enable_ops_log = true
rgw_ops_log_socket_path = /tmp/ops/rgw-ops.socket
rgw_ops_log_data_backlog = 16Mi

Seems like the socket fills up over time and it doesn't seem to get flushed, at some point the process runs out of file space.

I read the Admin socket with a normal unix socket client.

See the log here:

0> 2024-01-25T13:10:13.908+0000 7f247b00eb00 -1 *** Caught signal (File size limit exceeded) **
 in thread 7f247b00eb00 thread_name:ops_log_file

 ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/ 5 rgw_access
   1/ 5 rgw_dbstore
   1/ 5 rgw_flight
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_t
   0/ 5 seastore_cleaner
   0/ 5 seastore_epm
   0/ 5 seastore_lba
   0/ 5 seastore_fixedkv_tree
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 seastore_backref
   0/ 5 alienstore
   1/ 5 mclock
   0/ 5 cyanstore
   1/ 5 ceph_exporter
   1/ 5 memstore
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f2472a89b00 / safe_timer
  7f2472cadb00 / radosgw
  ...
  log_file /var/lib/ceph/crash/2024-01-25T13:10:13.909546Z_01ee6e6a-e946-4006-9d32-e17ef2f9df74/log
--- end dump of recent events ---
reraise_fatal: default handler for signal 25 didn't terminate the process?

What I am also noticing is that at some point before this termination happens the log level of RGW increases:

Normal

2024-01-25T15:47:58.444+0000 7fe98a5c0b00  1 ====== starting new request req=0x7fe98712c720 =====
2024-01-25T15:47:58.548+0000 7fe98b700b00  1 ====== req done req=0x7fe98712c720 op status=0 http_status=200 latency=0.104001537s ======
2024-01-25T15:47:58.548+0000 7fe98b700b00  1 beast: 0x7fe98712c720: redacted - redacted [25/Jan/2024:15:47:58.444 +0000] "PUT /redacted/redacted/chunks/27/27242/27242514_10_4194304 HTTP/1.1" 200 4194304 - "redacted" - latency=0.104001537s

Close to crash

 -509> 2024-01-25T14:54:31.588+0000 7f5186648b00  1 ====== starting new request req=0x7f517ffca720 =====
  -508> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s initializing for trans_id = tx0000023a42eb7515dcdc0-0065b27627-823feaa-central
  -507> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s getting op 1
  -506> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj verifying requester
  -505> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj normalizing buckets and tenants
  -504> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj init permissions
  -503> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj recalculating target
  -502> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj reading permissions
  -501> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj init op
  -500> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj verifying op mask
  -499> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj verifying op permissions
  -498> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted, subuser=, perm_mask=15, is_admin=0) mask=50
  -497> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for uid=redacted
  -496> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj Found permission: 15
  -495> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for group=1 mask=50
  -494> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj Permissions for group not found
  -493> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for group=2 mask=50
  -492> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj Permissions for group not found
  -491> 2024-01-25T14:54:31.588+0000 7f5186648b00  5 req 2568229052387020224 0.000000000s s3:put_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted, subuser=, perm_mask=15, is_admin=0), owner=redacted, perm=2
  -490> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj verifying op params
  -489> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj pre-executing
  -488> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj check rate limiting
  -487> 2024-01-25T14:54:31.588+0000 7f5186648b00  2 req 2568229052387020224 0.000000000s s3:put_obj executing
  -486> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req 2568229052387020224 0.036000550s s3:put_obj NOTICE: call to do_aws4_auth_completion
  -485> 2024-01-25T14:54:31.624+0000 7f5183898b00  5 req 2568229052387020224 0.036000550s s3:put_obj NOTICE: call to do_aws4_auth_completion
  -484> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req 2568229052387020224 0.092001401s s3:put_obj completing
  -483> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req 2568229052387020224 0.092001401s s3:put_obj op status=0
  -482> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  2 req 2568229052387020224 0.092001401s s3:put_obj http status=200
  -481> 2024-01-25T14:54:31.680+0000 7f5185bc8b00  1 ====== req done req=0x7f517ffca720 op status=0 http_status=200 latency=0.092001401s ====== 

I could not find the culprit, as I am not storing the full RGW logs.

Referencing the mailing list thread https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5V2OVJFVKYHLRK6LGER5RPE74VF2GBSL/


Related issues 3 (2 open1 closed)

Copied to rgw - Backport #65425: reef: Admin Ops socket crashes RGWNewMatt BenjaminActions
Copied to rgw - Backport #65426: quincy: Admin Ops socket crashes RGWNewMatt BenjaminActions
Copied to rgw - Backport #65427: squid: Admin Ops socket crashes RGWResolvedCasey BodleyActions
Actions

Also available in: Atom PDF