Actions
Bug #64244
openAdmin Ops socket crashes RGW
% Done:
0%
Source:
Community (user)
Tags:
backport_processed
Backport:
quincy reef squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I am encountering a problem with the RGW Admin Ops Socket.
I am setting up the socket as follows:
rgw_enable_ops_log = true
rgw_ops_log_socket_path = /tmp/ops/rgw-ops.socket
rgw_ops_log_data_backlog = 16Mi
Seems like the socket fills up over time and it doesn't seem to get flushed, at some point the process runs out of file space.
I read the Admin socket with a normal unix socket client.
See the log here:
0> 2024-01-25T13:10:13.908+0000 7f247b00eb00 -1 *** Caught signal (File size limit exceeded) ** in thread 7f247b00eb00 thread_name:ops_log_file ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/ 5 rgw_access 1/ 5 rgw_dbstore 1/ 5 rgw_flight 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_t 0/ 5 seastore_cleaner 0/ 5 seastore_epm 0/ 5 seastore_lba 0/ 5 seastore_fixedkv_tree 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 seastore_backref 0/ 5 alienstore 1/ 5 mclock 0/ 5 cyanstore 1/ 5 ceph_exporter 1/ 5 memstore -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f2472a89b00 / safe_timer 7f2472cadb00 / radosgw ... log_file /var/lib/ceph/crash/2024-01-25T13:10:13.909546Z_01ee6e6a-e946-4006-9d32-e17ef2f9df74/log --- end dump of recent events --- reraise_fatal: default handler for signal 25 didn't terminate the process?
What I am also noticing is that at some point before this termination happens the log level of RGW increases:
Normal
2024-01-25T15:47:58.444+0000 7fe98a5c0b00 1 ====== starting new request req=0x7fe98712c720 ===== 2024-01-25T15:47:58.548+0000 7fe98b700b00 1 ====== req done req=0x7fe98712c720 op status=0 http_status=200 latency=0.104001537s ====== 2024-01-25T15:47:58.548+0000 7fe98b700b00 1 beast: 0x7fe98712c720: redacted - redacted [25/Jan/2024:15:47:58.444 +0000] "PUT /redacted/redacted/chunks/27/27242/27242514_10_4194304 HTTP/1.1" 200 4194304 - "redacted" - latency=0.104001537s
Close to crash
-509> 2024-01-25T14:54:31.588+0000 7f5186648b00 1 ====== starting new request req=0x7f517ffca720 ===== -508> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s initializing for trans_id = tx0000023a42eb7515dcdc0-0065b27627-823feaa-central -507> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s getting op 1 -506> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj verifying requester -505> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj normalizing buckets and tenants -504> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj init permissions -503> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj recalculating target -502> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj reading permissions -501> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj init op -500> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj verifying op mask -499> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj verifying op permissions -498> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted, subuser=, perm_mask=15, is_admin=0) mask=50 -497> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for uid=redacted -496> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj Found permission: 15 -495> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for group=1 mask=50 -494> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj Permissions for group not found -493> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj Searching permissions for group=2 mask=50 -492> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj Permissions for group not found -491> 2024-01-25T14:54:31.588+0000 7f5186648b00 5 req 2568229052387020224 0.000000000s s3:put_obj -- Getting permissions done for identity=rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=redacted, acct_name=redacted, subuser=, perm_mask=15, is_admin=0), owner=redacted, perm=2 -490> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj verifying op params -489> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj pre-executing -488> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj check rate limiting -487> 2024-01-25T14:54:31.588+0000 7f5186648b00 2 req 2568229052387020224 0.000000000s s3:put_obj executing -486> 2024-01-25T14:54:31.624+0000 7f5183898b00 5 req 2568229052387020224 0.036000550s s3:put_obj NOTICE: call to do_aws4_auth_completion -485> 2024-01-25T14:54:31.624+0000 7f5183898b00 5 req 2568229052387020224 0.036000550s s3:put_obj NOTICE: call to do_aws4_auth_completion -484> 2024-01-25T14:54:31.680+0000 7f5185bc8b00 2 req 2568229052387020224 0.092001401s s3:put_obj completing -483> 2024-01-25T14:54:31.680+0000 7f5185bc8b00 2 req 2568229052387020224 0.092001401s s3:put_obj op status=0 -482> 2024-01-25T14:54:31.680+0000 7f5185bc8b00 2 req 2568229052387020224 0.092001401s s3:put_obj http status=200 -481> 2024-01-25T14:54:31.680+0000 7f5185bc8b00 1 ====== req done req=0x7f517ffca720 op status=0 http_status=200 latency=0.092001401s ======
I could not find the culprit, as I am not storing the full RGW logs.
Referencing the mailing list thread https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5V2OVJFVKYHLRK6LGER5RPE74VF2GBSL/
Actions