Project

General

Profile

Bug #51574

Segfault when uploading file

Added by Jan Graichen 2 months ago. Updated 17 days ago.

Status:
New
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We recently upgraded our cluster to 16.2.4 but got segmentations faults in radosgw when uploading files.

At first, I thought we are hit by https://tracker.ceph.com/issues/50556, as very few uploads did work, and we are using bucket policies, but I was able to reproduce the issue with the following devel versions too. As far as know, they should have included the backport from 50556.

16.2.4-568-g2e1902f3
16.2.4-670-g468a1be6

I did run a radosgw via docker to reproduce the issue:

docker run --rm -it --net=host --user 64045:64045 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name rgw.compute3 ceph/daemon-base:latest-pacific-devel@sha256:ce85def02b46df732434a553f0f343edd51ddbf67c1e0dc0a5b1ed19f32923ae radosgw -d --id rgw.test --keyring /etc/ceph/ceph.client.rgw.test.keyring --debug 255
2021-07-07T15:20:26.618+0000 7ff5f64e3440  0 ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable), process radosgw, pid 1
2021-07-07T15:20:26.618+0000 7ff5f64e3440  0 framework: civetweb
2021-07-07T15:20:26.618+0000 7ff5f64e3440  0 framework conf key: port, val: 127.0.0.1:6080
2021-07-07T15:20:26.618+0000 7ff5f64e3440  1 radosgw_Main not setting numa affinity
2021-07-07T15:20:26.618+0000 7ff5f64e3440 -1 asok(0x55ba6c6e4000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.rgw.test.1.94259171456320.asok': (13) Permission denied
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 framework: beast
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key
2021-07-07T15:20:26.910+0000 7ff5f64e3440  0 starting handler: civetweb
2021-07-07T15:20:27.002+0000 7ff5bd0e8700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.30, sleep 5, try again
2021-07-07T15:20:27.018+0000 7ff5f64e3440  1 mgrc service_daemon_register rgw.52645456 metadata {arch=x86_64,ceph_release=pacific,ceph_version=ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable),ceph_version_short=16.2.4-568-g2e1902f3,cpu=AMD EPYC 7302P 16-Core Processor,distro=centos,distro_description=CentOS Linux 8,distro_version=8,frontend_config#0=civetweb port=127.0.0.1:6080,frontend_type#0=civetweb,hostname=core-a,id=test,kernel_description=#52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020,kernel_version=5.4.0-48-generic,mem_swap_kb=16759804,mem_total_kb=131448768,num_handles=1,os=Linux,pid=1,zone_id=5d41157e-dd10-42a1-99c7-542bf1fc6645,zone_name=default,zonegroup_id=99c1add5-41f3-4b7a-b2bd-32a84919c2db,zonegroup_name=default}
2021-07-07T15:20:27.022+0000 7ff5b90e0700  0 lifecycle: RGWLC::process() failed to acquire lock on lc.5, sleep 5, try again
2021-07-07T15:21:02.215+0000 7ff5b48d7700  1 ====== starting new request req=0x7ff5b48ced10 =====
2021-07-07T15:21:02.223+0000 7ff5b48d7700  1 ====== req done req=0x7ff5b48ced10 op status=0 http_status=200 latency=0.008000219s ======
2021-07-07T15:21:02.223+0000 7ff5b48d7700  1 civetweb: 0x55ba6d864000: 127.0.0.1 - - [07/Jul/2021:15:21:02 +0000] "OPTIONS /uploads HTTP/1.0" 200 354 https://example.org/ Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0
2021-07-07T15:21:02.271+0000 7ff5b48d7700  1 ====== starting new request req=0x7ff5b48ced10 =====
2021-07-07T15:21:02.307+0000 7ff5b48d7700  0 req 2 0.036000986s s3:post_obj Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
2021-07-07T15:21:02.307+0000 7ff5b48d7700  0 req 2 0.036000986s Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
2021-07-07T15:21:02.311+0000 7ff5b48d7700  1 policy condition check $key [uploads/0f45545d-09ac-4040-a744-93aa3ddc4c47/13fec8c30338d94b6767ac4c6f54df14215b1d241e9719deaa5cc74608f43398_1.jpg] uploads/0f45545d-09ac-4040-a744-93aa3ddc4c47/ [uploads/0f45545d-09ac-4040-a744-93aa3ddc4c47/]
2021-07-07T15:21:02.311+0000 7ff5b48d7700  1 policy condition check $Content-Type [image/jpeg]  []
*** Caught signal (Segmentation fault) **
 in thread 7ff5b48d7700 thread_name:civetweb-worker
 ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7ff5ea8beb20]
 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7ff5f5717123]
 3: (rgw::sal::RGWObject::get_obj() const+0x20) [0x7ff5f5747320]
 4: (RGWPostObj::execute(optional_yield)+0xb0) [0x7ff5f5a76250]
 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xb12) [0x7ff5f56f5a82]
 6: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x2851) [0x7ff5f56f98d1]
 7: (RGWCivetWebFrontend::process(mg_connection*)+0x29b) [0x7ff5f562fa8b]
 8: /lib64/libradosgw.so.2(+0x62a8f6) [0x7ff5f57c88f6]
 9: /lib64/libradosgw.so.2(+0x62c567) [0x7ff5f57ca567]
 10: /lib64/libradosgw.so.2(+0x62ca28) [0x7ff5f57caa28]
 11: /lib64/libpthread.so.0(+0x814a) [0x7ff5ea8b414a]
 12: clone()
2021-07-07T15:21:02.315+0000 7ff5b48d7700 -1 *** Caught signal (Segmentation fault) **
 in thread 7ff5b48d7700 thread_name:civetweb-worker

 ceph version 16.2.4-568-g2e1902f3 (2e1902f3a43860da461e68ebea5ef8dd48418278) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7ff5ea8beb20]
 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7ff5f5717123]
 3: (rgw::sal::RGWObject::get_obj() const+0x20) [0x7ff5f5747320]
 4: (RGWPostObj::execute(optional_yield)+0xb0) [0x7ff5f5a76250]
 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xb12) [0x7ff5f56f5a82]
 6: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x2851) [0x7ff5f56f98d1]
 7: (RGWCivetWebFrontend::process(mg_connection*)+0x29b) [0x7ff5f562fa8b]
 8: /lib64/libradosgw.so.2(+0x62a8f6) [0x7ff5f57c88f6]
 9: /lib64/libradosgw.so.2(+0x62c567) [0x7ff5f57ca567]
 10: /lib64/libradosgw.so.2(+0x62ca28) [0x7ff5f57caa28]
 11: /lib64/libpthread.so.0(+0x814a) [0x7ff5ea8b414a]
 12: clone()

This completely blocks use from upgrading radosgw, as most buckets and uploads in our cloud are affected. We are currently running all components on 16.2.4 (via Debian packages), but only radosgw on v15.2 (via docker).


Related issues

Duplicates rgw - Bug #50556: Reproducible crash on multipart upload to bucket with policy Resolved

History

#1 Updated by Casey Bodley 2 months ago

  • Assignee set to Daniel Gryniewicz
  • Backport set to pacific

#2 Updated by Daniel Gryniewicz 2 months ago

I'm not sure how to find out what's in those devel versions, but that is, indeed the fix.

#3 Updated by Daniel Gryniewicz 2 months ago

  • Duplicates Bug #50556: Reproducible crash on multipart upload to bucket with policy added

#4 Updated by Jan Graichen 2 months ago

Thanks for investigating.

I'm not sure how to find out what's in those devel versions, but that is, indeed the fix.

I'd assume that the images match the git sha mentioned in the version:

  • 16.2.4-568-g2e1902f3 -> 2e1902f3
  • 16.2.4-670-g468a1be6 -> 468a1be6

Anyhow, I was able to reproduce this exact error on the just released v16.2.5 too. Shall I open a new bug?

> docker run --rm -it --net=host --user 64045:64045 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ --name rgw.compute3 ceph/ceph:v16.2.5 radosgw -d --id rgw.test --keyring /etc/ceph/ceph.client.rgw.test.keyring --debug 255
[..]
   -18> 2021-07-09T06:46:11.284+0000 7fa08e290700  1 ====== starting new request req=0x7fa08e287d10 =====
   -17> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s initializing for trans_id = tx000000000000000000003-0060e7f0b3-3241caa-default
   -16> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s getting op 4
   -15> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj verifying requester
   -14> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj normalizing buckets and tenants
   -13> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj init permissions
   -12> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj recalculating target
   -11> 2021-07-09T06:46:11.284+0000 7fa08e290700  2 req 3 0.000000000s s3:post_obj reading permissions
   -10> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj init op
    -9> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj verifying op mask
    -8> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj verifying op permissions
    -7> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj verifying op params
    -6> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj pre-executing
    -5> 2021-07-09T06:46:11.288+0000 7fa08e290700  2 req 3 0.004000111s s3:post_obj executing
    -4> 2021-07-09T06:46:11.288+0000 7fa08e290700  0 req 3 0.004000111s s3:post_obj Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
    -3> 2021-07-09T06:46:11.288+0000 7fa08e290700  0 req 3 0.004000111s Signature verification algorithm AWS v4 (AWS4-HMAC-SHA256)
    -2> 2021-07-09T06:46:11.292+0000 7fa08e290700  1 policy condition check $key [uploads/db4a86b1-c580-40b8-92cc-66a7cbe32e90/001.png] uploads/db4a86b1-c580-40b8-92cc-66a7cbe32e90/ [uploads/db4a86b1-c580-40b8-92cc-66a7cbe32e90/]
    -1> 2021-07-09T06:46:11.292+0000 7fa08e290700  1 policy condition check $Content-Type [image/png]  []
     0> 2021-07-09T06:46:11.300+0000 7fa08e290700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fa08e290700 thread_name:civetweb-worker

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7fa0c4277b20]
 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7fa0cf0cf103]
 3: (rgw::sal::RGWObject::get_obj() const+0x20) [0x7fa0cf0ff300]
 4: (RGWPostObj::execute(optional_yield)+0xb0) [0x7fa0cf42e230]
 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xb12) [0x7fa0cf0ada62]
 6: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x2851) [0x7fa0cf0b18b1]
 7: (RGWCivetWebFrontend::process(mg_connection*)+0x29b) [0x7fa0cefe7a6b]
 8: /lib64/libradosgw.so.2(+0x62a8d6) [0x7fa0cf1808d6]
 9: /lib64/libradosgw.so.2(+0x62c547) [0x7fa0cf182547]
 10: /lib64/libradosgw.so.2(+0x62ca08) [0x7fa0cf182a08]
 11: /lib64/libpthread.so.0(+0x814a) [0x7fa0c426d14a]
 12: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  140323209127680 / safe_timer
  140323225913088 / ms_dispatch
  140323234305792 / ceph_timer
  140323251091200 / io_context_pool
  140327556548352 / civetweb-worker
  140327615297280 / rgw_user_st_syn
  140327632082688 / lifecycle_thr_2
  140327665653504 / lifecycle_thr_1
  140327699224320 / lifecycle_thr_0
  140327833507584 / rgw_obj_expirer
  140327841900288 / rgw_gc
  140327858685696 / safe_timer
  140327875471104 / ms_dispatch
  140327900649216 / io_context_pool
  140327909041920 / rgw_dt_lg_renew
  140328194393856 / safe_timer
  140328211179264 / ms_dispatch
  140328219571968 / ceph_timer
  140328236357376 / io_context_pool
  140328295106304 / service
  140328303499008 / msgr-worker-2
  140328311891712 / msgr-worker-1
  140328320284416 / msgr-worker-0
  140328659690560 / radosgw
  max_recent     10000
  max_new        10000
  log_file /var/lib/ceph/crash/2021-07-09T06:46:11.300998Z_c27148e4-df0a-49dc-9839-8ae20e332b14/log
--- end dump of recent events ---
reraise_fatal: default handler for signal 11 didn't terminate the process?

#5 Updated by Daniel Gryniewicz 2 months ago

No, we'll track in this one.

#6 Updated by Jan Graichen 2 months ago

Thanks! If you need any more information or there is a docker/wip build that can be tested, please do tell.

#7 Updated by Jan Graichen 18 days ago

Is there anything I can help with?

#8 Updated by Daniel Gryniewicz 17 days ago

Do you have a reproducer? I've tried a few times, and failed to get the crash.

Also available in: Atom PDF