Project

General

Profile

Bug #24364

rgw: s3cmd sync fails

Added by David Galloway over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I'm trying to sync download.ceph.com packages to the Sepia Long Running Cluster and it's failing after upgrading to mimic (13.1.0-162-g233ee97-1xenial)

I've attached the rgw log. Let me know if you need ssh access to the rgw nodes.

s3cmd output:

WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 3 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 6 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 9 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 12 sec...
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected))
WARNING: Waiting 15 sec...
ERROR: S3 Temporary Error: Request failed for: /.  Please try again later.

RGW output:

    -1> 2018-05-31 11:03:22.118 7fe53e23b700  2 req 16:0.087059:s3:GET /:list_bucket:completing
     0> 2018-05-31 11:03:22.142 7fe53e23b700 -1 *** Caught signal (Aborted) **
 in thread 7fe53e23b700 thread_name:civetweb-worker

 ceph version 13.1.0-162-g233ee97 (233ee9764e2d1702dde72165d9660fec89ab56fc) mimic (rc)
 1: (()+0x291290) [0x5615a7344290]
 2: (()+0x11390) [0x7fe571475390]
 3: (gsignal()+0x38) [0x7fe566e16428]
 4: (abort()+0x16a) [0x7fe566e1802a]
 5: (()+0x777ea) [0x7fe566e587ea]
 6: (__fortify_fail()+0x5c) [0x7fe566efa15c]
 7: (()+0x117160) [0x7fe566ef8160]
 8: (()+0x1190a7) [0x7fe566efa0a7]
 9: (()+0x27564a) [0x5615a732864a]
 10: (mg_write()+0xc1) [0x5615a732f5a1]
 11: (RGWCivetWeb::write_data(char const*, unsigned long)+0x40) [0x5615a733ab20]
 12: (rgw::io::BufferingFilter<rgw::io::ChunkingFilter<rgw::io::ConLenControllingFilter<RGWCivetWeb*> > >::complete_request()+0x133) [0x5615a72c8a93]
 13: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, int*)+0xfc3) [0x5615a7451643]
 14: (RGWCivetWebFrontend::process(mg_connection*)+0x381) [0x5615a72c6271]
 15: (()+0x28251e) [0x5615a733551e]
 16: (()+0x284208) [0x5615a7337208]
 17: (()+0x2846a8) [0x5615a73376a8]
 18: (()+0x76ba) [0x7fe57146b6ba]
 19: (clone()+0x6d) [0x7fe566ee841d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/radosgw.dl-rgw-pub01.log
--- end dump of recent events ---

dl-mirror-dump (656 KB) David Galloway, 05/31/2018 03:07 PM

debug.log.xz (71.9 KB) David Galloway, 07/26/2018 07:41 PM


Related issues

Copied to rgw - Backport #35954: mimic: rgw: s3cmd sync fails Resolved

History

#1 Updated by David Galloway over 1 year ago

  • Description updated (diff)

#2 Updated by David Galloway over 1 year ago

  • Description updated (diff)

#3 Updated by Abhishek Lekshmanan over 1 year ago

can you share teh ceph.conf and s3cmd.cfg

#4 Updated by David Galloway over 1 year ago

# cat .s3cfg
[default]
access_key = XXXXXXXXXXXX
access_token = 
add_encoding_exts = 
add_headers = 
bucket_location = US
ca_certs_file = 
cache_file = 
check_ssl_certificate = True
check_ssl_hostname = True
#cloudfront_host = cloudfront.amazonaws.com
default_mime_type = binary/octet-stream
delay_updates = False
delete_after = False
delete_after_fetch = False
delete_removed = False
dry_run = False
enable_multipart = True
encoding = UTF-8
encrypt = False
expiry_date = 
expiry_days = 
expiry_prefix = 
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase = 
guess_mime_type = True
host_base = rdu.download.ceph.com
host_bucket = %(bucket)s.rdu.download.ceph.com
human_readable_sizes = False
invalidate_default_index_on_cf = False
invalidate_default_index_root_on_cf = True
invalidate_on_cf = False
kms_key = 
limitrate = 0
list_md5 = False
log_target_prefix = 
long_listing = False
max_delete = -1
mime_type = 
multipart_chunk_size_mb = 15
multipart_max_chunks = 10000
preserve_attrs = True
progress_meter = True
proxy_host = 
proxy_port = 0
put_continue = False
recursive = False
recv_chunk = 65536
reduced_redundancy = False
requester_pays = False
restore_days = 1
secret_key = XXXXXXXXXX
send_chunk = 65536
server_side_encryption = False
signature_v2 = False
simpledb_host = sdb.amazonaws.com
skip_existing = False
socket_timeout = 300
stats = False
stop_on_error = False
storage_class = 
urlencoding_mode = normal
use_https = True
#use_https = False
use_mime_magic = True
verbosity = WARNING
#website_endpoint = http://%(bucket).rdu-website.download.ceph.com/
#website_endpoint = http://rdu-website.download.ceph.com/
website_error = 
website_index = index.html
# cat /etc/ceph/ceph.conf 

<snip>

[client.radosgw.dl-rgw-pub01]
  log file = /var/log/ceph/$id.log
  rgw dns name = rdu.download.ceph.com
  rgw dns s3website name = rdu-website.download.ceph.com
  rgw thread pool size = 1024
  rgw enable ops log = false
  ms dispatch throttle bytes = 10737418240
  rgw cache lru size = 10000
  admin socket = /var/run/ceph/ceph-$name.asok
  objecter inflight op bytes = 10737418240
  objecter inflight ops = 4096
  rgw frontends = civetweb port=[::]:7480 access_log_file=/var/log/civetweb/access.log error_log_file=/var/log/civetweb/error.log request_timeout_ms=300000
  admin socket = /var/run/ceph/ceph-$name.asok
  rgw cache enabled = 1
  rgw resolve cname = true
  rgw enable usage log = true
  rgw expose bucket = true
  rgw zonegroup = default
  rgw_enable_apis = s3, swift, swift_auth, admin # no Website config
  rgw zone = default
  rgw enable static website = true
  rgw_max_chunk_size = 4194304

#5 Updated by Yehuda Sadeh over 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Abhishek Lekshmanan
  • Priority changed from Normal to High

#6 Updated by Yehuda Sadeh over 1 year ago

  • Subject changed from mimic: s3cmd sync fails to rgw: s3cmd sync fails

#7 Updated by Abhishek Lekshmanan over 1 year ago

Is this still reproducible in the setup, I tried reproducing this locally, but I'm not successful wkith that

#8 Updated by David Galloway over 1 year ago

I can't reproduce using a different folder from my workstation but I do still get the same error when trying to sync from download.ceph.com to the RGW. Are there more logs or debug output I could gather for you?

#9 Updated by Abhishek Lekshmanan over 1 year ago

Is this reproducible still with latest mimic (13.2.0)?

#10 Updated by David Galloway over 1 year ago

I updated the RGWs to 13.2.0 and am still seeing the same issue.

#11 Updated by Abhishek Lekshmanan over 1 year ago

Can we have this reproduced at debug rgw=20 debug ms=1.

#12 Updated by David Galloway over 1 year ago

Attached

#13 Updated by Abhishek Lekshmanan over 1 year ago

I was able to log in the node and see the bt, the bt itself arose from FD_SET where a fileno > 1024 was passed which was the configured ulimit for nofile in the system, increasing ulimit should hopefully solve the issue

#14 Updated by Abhishek Lekshmanan over 1 year ago

  • Status changed from Triaged to Fix Under Review

#15 Updated by Abhishek Lekshmanan over 1 year ago

I was able to log in the node and see the bt, the bt itself arose from FD_SET where a fileno > 1024 was passed which was the configured ulimit for nofile in the system, increasing ulimit should hopefully solve the issue

#16 Updated by David Galloway over 1 year ago

No dice even with the following set

root@dl-rgw-pub01:~# ps -u ceph
    PID TTY          TIME CMD
2787067 ?        00:02:01 radosgw

root@dl-rgw-pub01:~# grep 'open files' /proc/2787067/limits
Max open files            1048576              1048576              files     

#17 Updated by Kevin McCarthy about 1 year ago

I am seeing exact same bug:

in thread 7f5644c45700 thread_name:civetweb-worker
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
1: (()+0x281670) [0x55fd3e88f670]
2: (()+0xf6d0) [0x7f577208c6d0]
3: (gsignal()+0x37) [0x7f57664f0277]
4: (abort()+0x148) [0x7f57664f1968]
5: (()+0x78d37) [0x7f5766532d37]
6: (__fortify_fail()+0x37) [0x7f57665d26e7]
7: (()+0x116862) [0x7f57665d0862]
8: (()+0x118647) [0x7f57665d2647]
9: (()+0x2655da) [0x55fd3e8735da]
10: (mg_write()+0xc1) [0x55fd3e87a521]
11: (RGWCivetWeb::write_data(char const*, unsigned long)+0x40) [0x55fd3e885ae0]
12: (rgw::io::BufferingFilter&lt;rgw::io::ChunkingFilter&lt;rgw::io::ConLenControllingFilter&lt;RGWCivetWeb*&gt; > >::send_body(char const*, unsigned long)+0xf8) [0x55fd3e812f78]
13: (rgw::io::AccountingFilter&lt;rgw::io::RestfulClient*&gt;::send_body(char const*, unsigned long)+0x15) [0x55fd3e80fe05]
14: (dump_body(req_state*, char const*, unsigned long)+0x43) [0x55fd3ea22bd3]
15: (RGWGetObj_ObjStore_S3::send_response_data(ceph::buffer::list&, long, long)+0x280) [0x55fd3ea7a2a0]
16: (RGWGetObj::get_data_cb(ceph::buffer::list&, long, long)+0x67) [0x55fd3e94ffb7]
17: (RGWRados::get_obj_iterate_cb(RGWObjectCtx*, RGWObjState*, RGWBucketInfo const&, rgw_obj const&, rgw_raw_obj const&, long, long, long, bool, void*)+0xea) [0x55fd3e9d4d9a]
18: (()+0x3c7596) [0x55fd3e9d5596]
19: (RGWRados::iterate_obj(RGWObjectCtx&, RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int ()(RGWBucketInfo const&, rgw_obj const&, rgw_raw_obj const&, long, long, long, bool, RGWObjState, void*), void*)+0x3f9) [0x55fd3e9d7f69]
20: (RGWRados::Object::Read::iterate(long, long, RGWGetDataCB*)+0x2a1) [0x55fd3e9d83c1]
21: (RGWGetObj::execute()+0xc5c) [0x55fd3e976ebc]
22: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x1cd) [0x55fd3e9903ad]
23: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, int*)+0xe4c) [0x55fd3e9912fc]
24: (RGWCivetWebFrontend::process(mg_connection*)+0x37c) [0x55fd3e810f9c]
25: (()+0x27249e) [0x55fd3e88049e]
26: (()+0x274188) [0x55fd3e882188]
27: (()+0x274628) [0x55fd3e882628]
28: (()+0x7e25) [0x7f5772084e25]
29: (clone()+0x6d) [0x7f57665b8bad]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Any resolution?

#18 Updated by Kevin McCarthy about 1 year ago

Browsing buckets with large object/file counts seems to trigger this. I updated the NOFILE to 10X and it did not help.

#19 Updated by Abhishek Lekshmanan about 1 year ago

Was able to trigger this, have a fix staged at https://github.com/ceph/civetweb/pull/29

#20 Updated by Abhishek Lekshmanan about 1 year ago

Kevin, For now, if you're blocked, changing the rgw_frontends string to use beast instead of civetweb will get you moving, until this patch is backported to mimic.

#21 Updated by Abhishek Lekshmanan about 1 year ago

#22 Updated by Kevin McCarthy about 1 year ago

Thank you very much for the quick fix. Using beast seems to work for now, and will change back to civetweb once this patch is GA.

#23 Updated by Nathan Cutler about 1 year ago

  • Status changed from Fix Under Review to Resolved
  • Backport set to mimic

Also available in: Atom PDF