Project

General

Profile

Actions

Bug #51574

open

Segfault when uploading file

Added by Jan Graichen almost 3 years ago. Updated 11 months ago.

Status:
Fix Under Review
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We recently upgraded our cluster to 16.2.4 but got segmentations faults in radosgw when uploading files.

At first, I thought we are hit by https://tracker.ceph.com/issues/50556, as very few uploads did work, and we are using bucket policies, but I was able to reproduce the issue with the following devel versions too. As far as know, they should have included the backport from 50556.

16.2.4-568-g2e1902f3
16.2.4-670-g468a1be6

I did run a radosgw via docker to reproduce the issue:

docker run --rm -it --net=host --user 64045:64045 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name rgw.compute3 ceph/daemon-base:latest-pacific-devel@sha256:ce85def02b46df732434a553f0f343edd51ddbf67c1e0dc0a5b1ed19f32923ae radosgw -d --id rgw.test --keyring /etc/ceph/ceph.client.rgw.test.keyring --debug 255
2021-07-07T15:20:26.618+0000 7ff5f64e3440  0 ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable), process radosgw, pid 1
2021-07-07T15:20:26.618+0000 7ff5f64e3440  0 framework: civetweb
2021-07-07T15:20:26.618+0000 7ff5f64e3440  0 framework conf key: port, val: 127.0.0.1:6080
2021-07-07T15:20:26.618+0000 7ff5f64e3440  1 radosgw_Main not setting numa affinity
2021-07-07T15:20:26.618+0000 7ff5f64e3440 -1 asok(0x55ba6c6e4000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.rgw.test.1.94259171456320.asok': (13) Permission denied
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 framework: beast
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 starting handler: civetweb
2021-07-07T15:20:27.002+0000 7ff5bd0e8700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.30, sleep 5, try again
2021-07-07T15:20:27.018+0000 7ff5f64e3440  1 mgrc service_daemon_register rgw.52645456 metadata {arch=x86_64,ceph_release=pacific,ceph_version=ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable),ceph_version_short=16.2.4-568-g2e1902f3,cpu=AMD EPYC 7302P 16-Core Processor,distro=centos,distro_description=CentOS Linux 8,distro_version=8,frontend_config#0=civetweb port=127.0.0.1:6080,frontend_type#0=civetweb,hostname=core-a,id=test,kernel_description=#52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020,kernel_version=5.4.0-48-generic,mem_swap_kb=16759804,mem_total_kb=131448768,num_handles=1,os=Linux,pid=1,zone_id=5d41157e-dd10-42a1-99c7-542bf1fc6645,zone_name=default,zonegroup_id=99c1add5-41f3-4b7a-b2bd-32a84919c2db,zonegroup_name=default}
2021-07-07T15:20:27.022+0000 7ff5b90e0700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.5, sleep 5, try again
2021-07-07T15:21:02.215+0000 7ff5b48d7700  1 ====== starting new request req=0x7ff5b48ced10 =====
2021-07-07T15:21:02.223+0000 7ff5b48d7700  1 ====== req done req=0x7ff5b48ced10 op status=0 http_status=200 latency=0.008000219s ======
2021-07-07T15:21:02.223+0000 7ff5b48d7700  1 civetweb: 0x55ba6d864000: 127.0.0.1 - - [07/Jul/2021:15:21:02 +0000] "OPTIONS /uploads HTTP/1.0" 200 354 https://example.org/ Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0
2021-07-07T15:21:02.271+0000 7ff5b48d7700  1 ====== starting new request req=0x7ff5b48ced10 =====
2021-07-07T15:21:02.307+0000 7ff5b48d7700  0 req 2 0.036000986s s3:post_obj Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
2021-07-07T15:21:02.307+0000 7ff5b48d7700  0 req 2 0.036000986s Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
2021-07-07T15:21:02.311+0000 7ff5b48d7700  1 policy condition check $key [uploads/0f45545d-09ac-4040-a744-93aa3ddc4c47/13fec8c30338d94b6767ac4c6f54df14215b1d241e9719deaa5cc74608f43398_1.jpg] uploads/0f45545d-09ac-4040-a744-93aa3ddc4c47/ [uploads/0f45545d-09ac-4040-a744-93aa3ddc4c47/]
2021-07-07T15:21:02.311+0000 7ff5b48d7700  1 policy condition check $Content-Type [image/jpeg]  []
*** Caught signal (Segmentation fault) **
 in thread 7ff5b48d7700 thread_name:civetweb-worker
 ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7ff5ea8beb20]
 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7ff5f5717123]
 3: (rgw::sal::RGWObject::get_obj() const+0x20) [0x7ff5f5747320]
 4: (RGWPostObj::execute(optional_yield)+0xb0) [0x7ff5f5a76250]
 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xb12) [0x7ff5f56f5a82]
 6: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x2851) [0x7ff5f56f98d1]
 7: (RGWCivetWebFrontend::process(mg_connection*)+0x29b) [0x7ff5f562fa8b]
 8: /lib64/libradosgw.so.2(+0x62a8f6) [0x7ff5f57c88f6]
 9: /lib64/libradosgw.so.2(+0x62c567) [0x7ff5f57ca567]
 10: /lib64/libradosgw.so.2(+0x62ca28) [0x7ff5f57caa28]
 11: /lib64/libpthread.so.0(+0x814a) [0x7ff5ea8b414a]
 12: clone()
2021-07-07T15:21:02.315+0000 7ff5b48d7700 -1 *** Caught signal (Segmentation fault) **
 in thread 7ff5b48d7700 thread_name:civetweb-worker

 ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7ff5ea8beb20]
 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7ff5f5717123]
 3: (rgw::sal::RGWObject::get_obj() const+0x20) [0x7ff5f5747320]
 4: (RGWPostObj::execute(optional_yield)+0xb0) [0x7ff5f5a76250]
 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xb12) [0x7ff5f56f5a82]
 6: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x2851) [0x7ff5f56f98d1]
 7: (RGWCivetWebFrontend::process(mg_connection*)+0x29b) [0x7ff5f562fa8b]
 8: /lib64/libradosgw.so.2(+0x62a8f6) [0x7ff5f57c88f6]
 9: /lib64/libradosgw.so.2(+0x62c567) [0x7ff5f57ca567]
 10: /lib64/libradosgw.so.2(+0x62ca28) [0x7ff5f57caa28]
 11: /lib64/libpthread.so.0(+0x814a) [0x7ff5ea8b414a]
 12: clone()

This completely blocks use from upgrading radosgw, as most buckets and uploads in our cloud are affected. We are currently running all components on 16.2.4 (via Debian packages), but only radosgw on v15.2 (via docker).


Related issues 1 (0 open1 closed)

Is duplicate of rgw - Bug #50556: Reproducible crash on multipart upload to bucket with policyResolvedOr Friedmann

Actions
Actions #1

Updated by Casey Bodley almost 3 years ago

  • Assignee set to Daniel Gryniewicz
  • Backport set to pacific
Actions #2

Updated by Daniel Gryniewicz almost 3 years ago

I'm not sure how to find out what's in those devel versions, but that is, indeed the fix.

Actions #3

Updated by Daniel Gryniewicz almost 3 years ago

  • Is duplicate of Bug #50556: Reproducible crash on multipart upload to bucket with policy added
Actions #4

Updated by Jan Graichen almost 3 years ago

Thanks for investigating.

I'm not sure how to find out what's in those devel versions, but that is, indeed the fix.

I'd assume that the images match the git sha mentioned in the version:

  • 16.2.4-568-g2e1902f3 -> 2e1902f3
  • 16.2.4-670-g468a1be6 -> 468a1be6

Anyhow, I was able to reproduce this exact error on the just released v16.2.5 too. Shall I open a new bug?

> docker run --rm -it --net=host --user 64045:64045 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name rgw.compute3 ceph/ceph:v16.2.5 radosgw -d --id rgw.test --keyring /etc/ceph/ceph.client.rgw.test.keyring --debug 255
[..]
   -18> 2021-07-09T06:46:11.284+0000 7fa08e290700  1 ====== starting new request req=0x7fa08e287d10 =====
   -17> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s initializing for trans_id = tx000000000000000000003-0060e7f0b3-3241caa-default
   -16> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s getting op 4
   -15> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj verifying requester
   -14> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj normalizing buckets and tenants
   -13> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj init permissions
   -12> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj recalculating target
   -11> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj reading permissions
   -10> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj init op
    -9> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj verifying op mask
    -8> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj verifying op permissions
    -7> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj verifying op params
    -6> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj pre-executing
    -5> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj executing
    -4> 2021-07-09T06:46:11.288+0000 7fa08e290700  0 req 3 0.004000111s s3:post_obj Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
    -3> 2021-07-09T06:46:11.288+0000 7fa08e290700  0 req 3 0.004000111s Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
    -2> 2021-07-09T06:46:11.292+0000 7fa08e290700  1 policy condition check $key [uploads/db4a86b1-c580-40b8-92cc-66a7cbe32e90/001.png] uploads/db4a86b1-c580-40b8-92cc-66a7cbe32e90/ [uploads/db4a86b1-c580-40b8-92cc-66a7cbe32e90/]
    -1> 2021-07-09T06:46:11.292+0000 7fa08e290700  1 policy condition check $Content-Type [image/png]  []
     0> 2021-07-09T06:46:11.300+0000 7fa08e290700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fa08e290700 thread_name:civetweb-worker

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7fa0c4277b20]
 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7fa0cf0cf103]
 3: (rgw::sal::RGWObject::get_obj() const+0x20) [0x7fa0cf0ff300]
 4: (RGWPostObj::execute(optional_yield)+0xb0) [0x7fa0cf42e230]
 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xb12) [0x7fa0cf0ada62]
 6: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x2851) [0x7fa0cf0b18b1]
 7: (RGWCivetWebFrontend::process(mg_connection*)+0x29b) [0x7fa0cefe7a6b]
 8: /lib64/libradosgw.so.2(+0x62a8d6) [0x7fa0cf1808d6]
 9: /lib64/libradosgw.so.2(+0x62c547) [0x7fa0cf182547]
 10: /lib64/libradosgw.so.2(+0x62ca08) [0x7fa0cf182a08]
 11: /lib64/libpthread.so.0(+0x814a) [0x7fa0c426d14a]
 12: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  140323209127680 / safe_timer
  140323225913088 / ms_dispatch
  140323234305792 / ceph_timer
  140323251091200 / io_context_pool
  140327556548352 / civetweb-worker
  140327615297280 / rgw_user_st_syn
  140327632082688 / lifecycle_thr_2
  140327665653504 / lifecycle_thr_1
  140327699224320 / lifecycle_thr_0
  140327833507584 / rgw_obj_expirer
  140327841900288 / rgw_gc
  140327858685696 / safe_timer
  140327875471104 / ms_dispatch
  140327900649216 / io_context_pool
  140327909041920 / rgw_dt_lg_renew
  140328194393856 / safe_timer
  140328211179264 / ms_dispatch
  140328219571968 / ceph_timer
  140328236357376 / io_context_pool
  140328295106304 / service
  140328303499008 / msgr-worker-2
  140328311891712 / msgr-worker-1
  140328320284416 / msgr-worker-0
  140328659690560 / radosgw
  max_recent     10000
  max_new        10000
  log_file /var/lib/ceph/crash/2021-07-09T06:46:11.300998Z_c27148e4-df0a-49dc-9839-8ae20e332b14/log
--- end dump of recent events ---
reraise_fatal: default handler for signal 11 didn't terminate the process?

Actions #5

Updated by Daniel Gryniewicz almost 3 years ago

No, we'll track in this one.

Actions #6

Updated by Jan Graichen almost 3 years ago

Thanks! If you need any more information or there is a docker/wip build that can be tested, please do tell.

Actions #7

Updated by Jan Graichen over 2 years ago

Is there anything I can help with?

Actions #8

Updated by Daniel Gryniewicz over 2 years ago

Do you have a reproducer? I've tried a few times, and failed to get the crash.

Actions #9

Updated by Casey Bodley over 2 years ago

  • Status changed from New to Need More Info
Actions #10

Updated by Jan Graichen over 2 years ago

I apologize for my delayed response. I'll try the latest images again and check if they still fail on our existing cluster. If that is the case, I will try to reproduce it on a new cluster.

Actions #11

Updated by Jan Graichen over 2 years ago

The issue is still happening on our cluster with 16.2.6.

I am still working on reproducing it on a clean demo system, but the problem appears to be related to an embedded policy in a presigned URL used for upload:

``` {
"expiration": "2021-11-22T20:12:19Z",
"conditions": [ { "bucket": "public-uploads" },
[
"starts-with",
"$key",
"uploads/927d8bc1-b8cb-4f61-9730-ae60a9e67cf2/"
], { "acl": "private" }, { "x-amz-meta-x-purpose": "user-upload" }, { "x-amz-meta-x-state": "accepted" },
[
"starts-with",
"$Content-Type",
""
], { "x-amz-credential": "GJZAXZUI3304UX7CGXCY/20211122/default/s3/aws4_request" }, { "x-amz-algorithm": "AWS4-HMAC-SHA256" }, { "x-amz-date": "20211122T171219Z" }
]
}
```

The following rules might be the issue:

```
[
"starts-with",
"$Content-Type",
""
],
```

This would match the last log line before the fault:

```
-1> 2021-07-09T06:46:11.292+0000 7fa08e290700 1 policy condition check $Content-Type [image/png] []
```

I tried to run some tests on the existing system without that rule and radosgw didn't crash, but the upload was rejected with `env var missing in policy: Content-Type`. I'll try to provide a script to reproduce this.

Actions #12

Updated by Jan Graichen over 2 years ago

I was able to reproduce the bug on a blank new ceph using the `ceph/daemon` demo mode. A bucket policy is needed to trigger the segfault.

This repository (https://github.com/jgraichen/ceph-rgw-51574) contains a Ruby script that creates a bucket, assigns the policy, and uploads a file. This reliably crashed the radosgw daemon.

Actions #13

Updated by Jan Graichen over 2 years ago

Any more information I can provide you?

Actions #14

Updated by Jan Graichen about 2 years ago

I confirmed that the test script in https://github.com/jgraichen/ceph-rgw-51574 also results in segfault in 16.2.7:

```
[root@989caa0a50fe /]# radosgw --version
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
```

Current devel versions are still affected too.

Is there anything I can help you with? Please tell me, I do hope to receive email notifications from Redmine now.

Actions #15

Updated by Daniel Gryniewicz about 2 years ago

  • Pull request ID set to 45060
Actions #16

Updated by Daniel Gryniewicz about 2 years ago

  • Status changed from Need More Info to Fix Under Review
Actions #18

Updated by Jan Graichen almost 2 years ago

I run the test script at https://github.com/jgraichen/ceph-rgw-51574 on 17.2.0 (quay.io/ceph/daemon:latest):

[root@911007478a97 /]# ceph --version
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)

Sending the request from the test script, the radosgw process is still crashing.

This confirms that 17.2.0 is affected too.

Is there anything I can help to fix this bug? It is getting quite old and spans two major versions now. Our cluster is still stuck in some v16 ceph with a v15 radosgw because all newer versions are segfaulting. Is the only option not to use bucket policies?

Actions #19

Updated by Daniel Gryniewicz almost 2 years ago

The fix in question will be in 16.2.8, since it was committed after 16.2.7 was released.

Actions #20

Updated by Jan Graichen almost 2 years ago

Shall I open a new issue for the segfault on 17.2.0?

Actions #21

Updated by Jan Graichen almost 2 years ago

Ceph Quincy 17.2.1 is affected too. The test script reliably crashes the radosgw daemon: https://github.com/jgraichen/ceph-rgw-51574/runs/7385299348?check_suite_focus=true

Shall I open a new ticket for that?

Actions #22

Updated by Jan Graichen over 1 year ago

Hello,

As Ceph Quincy 17.2.3 is still segfaulting using the same test script as for Pacific before the fix, we still cannot upgrade ceph or RadosGW to Quincy.

Shall I open a new ticket to tracking the bug for Ceph Quincy 17.2.3?

Best regards

Actions #23

Updated by Jan Graichen over 1 year ago

Here is the stacktrace from running the test script with ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable):

   -19> 2022-10-04T17:16:30.575+0000 7f93bf535700  1 ====== starting new request req=0x7f945c8d3650 =====
   -18> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s initializing for trans_id = tx00000770811328923926f-00633c6a6e-1014-default
   -17> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s getting op 4
   -16> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj verifying requester
   -15> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj normalizing buckets and tenants
   -14> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj init permissions
   -13> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj recalculating target
   -12> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj reading permissions
   -11> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj init op
   -10> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj verifying op mask
    -9> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj verifying op permissions
    -8> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj verifying op params
    -7> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj pre-executing
    -6> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj check rate limiting
    -5> 2022-10-04T17:16:30.575+0000 7f93bf535700  2 req 8577124399073956463 0.000000000s s3:post_obj executing
    -4> 2022-10-04T17:16:30.575+0000 7f93bf535700  0 req 8577124399073956463 0.000000000s s3:post_obj Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
    -3> 2022-10-04T17:16:30.575+0000 7f93bf535700  0 req 8577124399073956463 0.000000000s Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
    -2> 2022-10-04T17:16:30.575+0000 7f93bf535700  1 policy condition check $key [uploads/test.txt] uploads/ [uploads/]
    -1> 2022-10-04T17:16:30.575+0000 7f93bf535700  1 policy condition check $Content-Type []  []
     0> 2022-10-04T17:16:30.579+0000 7f93bf535700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f93bf535700 thread_name:radosgw

 ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f94587a8cf0]
 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7f945b4eb413]
 3: (rgw::sal::Object::get_obj() const+0x2b) [0x7f945b6d4beb]
 4: (RGWPostObj::execute(optional_yield)+0xa4) [0x7f945b876c34]
 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Store*, bool)+0xb3f) [0x7f945b4c669f]
 6: (process_request(rgw::sal::Store*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSink*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, std::shared_ptr<RateLimiter>, int*)+0x2616) [0x7f945b4c9606]
 7: /lib64/libradosgw.so.2(+0x659f7a) [0x7f945b435f7a]
 8: /lib64/libradosgw.so.2(+0x65b621) [0x7f945b437621]
 9: /lib64/libradosgw.so.2(+0x65b79c) [0x7f945b43779c]
 10: make_fcontext()

You can get the same logs for each tested ceph version here: https://github.com/jgraichen/ceph-rgw-51574/actions/runs/3184091709

Actions #24

Updated by Thomas Mertz 11 months ago

Hello,

I have the same problem on ceph 17.2.3

Will the fix be backported to Quincy ? I think that 17.2.4, 17.2.5 are also impacted

best regards.

Actions #25

Updated by Daniel Gryniewicz 11 months ago

This fix only applies to pacific. Quincy and later had a different (much more extensive) fix for this particular crash.

Actions

Also available in: Atom PDF