Project

General

Profile

Bug #24364

rgw: s3cmd sync fails

Added by David Galloway about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Target version:
-
Start date:
05/31/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I'm trying to sync download.ceph.com packages to the Sepia Long Running Cluster and it's failing after upgrading to mimic (13.1.0-162-g233ee97-1xenial)

I've attached the rgw log. Let me know if you need ssh access to the rgw nodes.

s3cmd output:

WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 3 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 6 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 9 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 12 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 15 sec...
ERROR: S3 Temporary Error: Request failed for: /.  Please try again later.

RGW output:

    -1> 2018-05-31 11:03:22.118 7fe53e23b700  2 req 16:0.087059:s3:GET /:list_bucket:completing
     0> 2018-05-31 11:03:22.142 7fe53e23b700 -1 *** Caught signal (Aborted) **
 in thread 7fe53e23b700 thread_name:civetweb-worker

 ceph version 13.1.0-162-g233ee97 (233ee9764e2d1702dde72165d9660fec89ab56fc) mimic (rc)
 1: (()+0x291290) [0x5615a7344290]
 2: (()+0x11390) [0x7fe571475390]
 3: (gsignal()+0x38) [0x7fe566e16428]
 4: (abort()+0x16a) [0x7fe566e1802a]
 5: (()+0x777ea) [0x7fe566e587ea]
 6: (__fortify_fail()+0x5c) [0x7fe566efa15c]
 7: (()+0x117160) [0x7fe566ef8160]
 8: (()+0x1190a7) [0x7fe566efa0a7]
 9: (()+0x27564a) [0x5615a732864a]
 10: (mg_write()+0xc1) [0x5615a732f5a1]
 11: (RGWCivetWeb::write_data(char const*, unsigned long)+0x40) [0x5615a733ab20]
 12: (rgw::io::BufferingFilter<rgw::io::ChunkingFilter<rgw::io::ConLenControllingFilter<RGWCivetWeb*> > >::complete_request()+0x133) [0x5615a72c8a93]
 13: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, int*)+0xfc3) [0x5615a7451643]
 14: (RGWCivetWebFrontend::process(mg_connection*)+0x381) [0x5615a72c6271]
 15: (()+0x28251e) [0x5615a733551e]
 16: (()+0x284208) [0x5615a7337208]
 17: (()+0x2846a8) [0x5615a73376a8]
 18: (()+0x76ba) [0x7fe57146b6ba]
 19: (clone()+0x6d) [0x7fe566ee841d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/radosgw.dl-rgw-pub01.log
--- end dump of recent events ---

dl-mirror-dump (656 KB) David Galloway, 05/31/2018 03:07 PM

debug.log.xz (71.9 KB) David Galloway, 07/26/2018 07:41 PM


Related issues

Copied to rgw - Backport #35954: mimic: rgw: s3cmd sync fails Resolved

History

#1 Updated by David Galloway about 1 year ago

  • Description updated (diff)

#2 Updated by David Galloway about 1 year ago

  • Description updated (diff)

#3 Updated by Abhishek Lekshmanan about 1 year ago

can you share teh ceph.conf and s3cmd.cfg

#4 Updated by David Galloway about 1 year ago

# cat .s3cfg
[default]
access_key = XXXXXXXXXXXX
access_token = 
add_encoding_exts = 
add_headers = 
bucket_location = US
ca_certs_file = 
cache_file = 
check_ssl_certificate = True
check_ssl_hostname = True
#cloudfront_host = cloudfront.amazonaws.com
default_mime_type = binary/octet-stream
delay_updates = False
delete_after = False
delete_after_fetch = False
delete_removed = False
dry_run = False
enable_multipart = True
encoding = UTF-8
encrypt = False
expiry_date = 
expiry_days = 
expiry_prefix = 
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase = 
guess_mime_type = True
host_base = rdu.download.ceph.com
host_bucket = %(bucket)s.rdu.download.ceph.com
human_readable_sizes = False
invalidate_default_index_on_cf = False
invalidate_default_index_root_on_cf = True
invalidate_on_cf = False
kms_key = 
limitrate = 0
list_md5 = False
log_target_prefix = 
long_listing = False
max_delete = -1
mime_type = 
multipart_chunk_size_mb = 15
multipart_max_chunks = 10000
preserve_attrs = True
progress_meter = True
proxy_host = 
proxy_port = 0
put_continue = False
recursive = False
recv_chunk = 65536
reduced_redundancy = False
requester_pays = False
restore_days = 1
secret_key = XXXXXXXXXX
send_chunk = 65536
server_side_encryption = False
signature_v2 = False
simpledb_host = sdb.amazonaws.com
skip_existing = False
socket_timeout = 300
stats = False
stop_on_error = False
storage_class = 
urlencoding_mode = normal
use_https = True
#use_https = False
use_mime_magic = True
verbosity = WARNING
#website_endpoint = http://%(bucket).rdu-website.download.ceph.com/
#website_endpoint = http://rdu-website.download.ceph.com/
website_error = 
website_index = index.html
# cat /etc/ceph/ceph.conf 

<snip>

[client.radosgw.dl-rgw-pub01]
  log file = /var/log/ceph/$id.log
  rgw dns name = rdu.download.ceph.com
  rgw dns s3website name = rdu-website.download.ceph.com
  rgw thread pool size = 1024
  rgw enable ops log = false
  ms dispatch throttle bytes = 10737418240
  rgw cache lru size = 10000
  admin socket = /var/run/ceph/ceph-$name.asok
  objecter inflight op bytes = 10737418240
  objecter inflight ops = 4096
  rgw frontends = civetweb port=[::]:7480 access_log_file=/var/log/civetweb/access.log error_log_file=/var/log/civetweb/error.log request_timeout_ms=300000
  admin socket = /var/run/ceph/ceph-$name.asok
  rgw cache enabled = 1
  rgw resolve cname = true
  rgw enable usage log = true
  rgw expose bucket = true
  rgw zonegroup = default
  rgw_enable_apis = s3, swift, swift_auth, admin # no Website config
  rgw zone = default
  rgw enable static website = true
  rgw_max_chunk_size = 4194304

#5 Updated by Yehuda Sadeh about 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Abhishek Lekshmanan
  • Priority changed from Normal to High

#6 Updated by Yehuda Sadeh about 1 year ago

  • Subject changed from mimic: s3cmd sync fails to rgw: s3cmd sync fails

#7 Updated by Abhishek Lekshmanan about 1 year ago

Is this still reproducible in the setup, I tried reproducing this locally, but I'm not successful wkith that

#8 Updated by David Galloway about 1 year ago

I can't reproduce using a different folder from my workstation but I do still get the same error when trying to sync from download.ceph.com to the RGW. Are there more logs or debug output I could gather for you?

#9 Updated by Abhishek Lekshmanan about 1 year ago

Is this reproducible still with latest mimic (13.2.0)?

#10 Updated by David Galloway about 1 year ago

I updated the RGWs to 13.2.0 and am still seeing the same issue.

#11 Updated by Abhishek Lekshmanan about 1 year ago

Can we have this reproduced at debug rgw=20 debug ms=1.

#12 Updated by David Galloway about 1 year ago

Attached

#13 Updated by Abhishek Lekshmanan 12 months ago

I was able to log in the node and see the bt, the bt itself arose from FD_SET where a fileno > 1024 was passed which was the configured ulimit for nofile in the system, increasing ulimit should hopefully solve the issue

#14 Updated by Abhishek Lekshmanan 12 months ago

  • Status changed from Triaged to Need Review

#15 Updated by Abhishek Lekshmanan 12 months ago

I was able to log in the node and see the bt, the bt itself arose from FD_SET where a fileno > 1024 was passed which was the configured ulimit for nofile in the system, increasing ulimit should hopefully solve the issue

#16 Updated by David Galloway 12 months ago

No dice even with the following set

root@dl-rgw-pub01:~# ps -u ceph
    PID TTY          TIME CMD
2787067 ?        00:02:01 radosgw

root@dl-rgw-pub01:~# grep 'open files' /proc/2787067/limits
Max open files            1048576              1048576              files     

#17 Updated by Kevin McCarthy 11 months ago

I am seeing exact same bug:

in thread 7f5644c45700 thread_name:civetweb-worker
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
1: (()+0x281670) [0x55fd3e88f670]
2: (()+0xf6d0) [0x7f577208c6d0]
3: (gsignal()+0x37) [0x7f57664f0277]
4: (abort()+0x148) [0x7f57664f1968]
5: (()+0x78d37) [0x7f5766532d37]
6: (__fortify_fail()+0x37) [0x7f57665d26e7]
7: (()+0x116862) [0x7f57665d0862]
8: (()+0x118647) [0x7f57665d2647]
9: (()+0x2655da) [0x55fd3e8735da]
10: (mg_write()+0xc1) [0x55fd3e87a521]
11: (RGWCivetWeb::write_data(char const*, unsigned long)+0x40) [0x55fd3e885ae0]
12: (rgw::io::BufferingFilter&lt;rgw::io::ChunkingFilter&lt;rgw::io::ConLenControllingFilter&lt;RGWCivetWeb*&gt; > >::send_body(char const*, unsigned long)+0xf8) [0x55fd3e812f78]
13: (rgw::io::AccountingFilter&lt;rgw::io::RestfulClient*&gt;::send_body(char const*, unsigned long)+0x15) [0x55fd3e80fe05]
14: (dump_body(req_state*, char const*, unsigned long)+0x43) [0x55fd3ea22bd3]
15: (RGWGetObj_ObjStore_S3::send_response_data(ceph::buffer::list&, long, long)+0x280) [0x55fd3ea7a2a0]
16: (RGWGetObj::get_data_cb(ceph::buffer::list&, long, long)+0x67) [0x55fd3e94ffb7]
17: (RGWRados::get_obj_iterate_cb(RGWObjectCtx*, RGWObjState*, RGWBucketInfo const&, rgw_obj const&, rgw_raw_obj const&, long, long, long, bool, void*)+0xea) [0x55fd3e9d4d9a]
18: (()+0x3c7596) [0x55fd3e9d5596]
19: (RGWRados::iterate_obj(RGWObjectCtx&, RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int ()(RGWBucketInfo const&, rgw_obj const&, rgw_raw_obj const&, long, long, long, bool, RGWObjState, void*), void*)+0x3f9) [0x55fd3e9d7f69]
20: (RGWRados::Object::Read::iterate(long, long, RGWGetDataCB*)+0x2a1) [0x55fd3e9d83c1]
21: (RGWGetObj::execute()+0xc5c) [0x55fd3e976ebc]
22: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x1cd) [0x55fd3e9903ad]
23: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, int*)+0xe4c) [0x55fd3e9912fc]
24: (RGWCivetWebFrontend::process(mg_connection*)+0x37c) [0x55fd3e810f9c]
25: (()+0x27249e) [0x55fd3e88049e]
26: (()+0x274188) [0x55fd3e882188]
27: (()+0x274628) [0x55fd3e882628]
28: (()+0x7e25) [0x7f5772084e25]
29: (clone()+0x6d) [0x7f57665b8bad]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Any resolution?

#18 Updated by Kevin McCarthy 11 months ago

Browsing buckets with large object/file counts seems to trigger this. I updated the NOFILE to 10X and it did not help.

#19 Updated by Abhishek Lekshmanan 11 months ago

Was able to trigger this, have a fix staged at https://github.com/ceph/civetweb/pull/29

#20 Updated by Abhishek Lekshmanan 11 months ago

Kevin, For now, if you're blocked, changing the rgw_frontends string to use beast instead of civetweb will get you moving, until this patch is backported to mimic.

#21 Updated by Abhishek Lekshmanan 11 months ago

#22 Updated by Kevin McCarthy 11 months ago

Thank you very much for the quick fix. Using beast seems to work for now, and will change back to civetweb once this patch is GA.

#23 Updated by Nathan Cutler 11 months ago

  • Status changed from Need Review to Resolved
  • Backport set to mimic

Also available in: Atom PDF