Bug #24364
closedrgw: s3cmd sync fails
0%
Description
I'm trying to sync download.ceph.com packages to the Sepia Long Running Cluster and it's failing after upgrading to mimic (13.1.0-162-g233ee97-1xenial)
I've attached the rgw log. Let me know if you need ssh access to the rgw nodes.
s3cmd output:
WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected)) WARNING: Waiting 3 sec... WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected)) WARNING: Waiting 6 sec... WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected)) WARNING: Waiting 9 sec... WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected)) WARNING: Waiting 12 sec... WARNING: Retrying failed request: / (IncompleteRead(85432 bytes read, 260881 more expected)) WARNING: Waiting 15 sec... ERROR: S3 Temporary Error: Request failed for: /. Please try again later.
RGW output:
-1> 2018-05-31 11:03:22.118 7fe53e23b700 2 req 16:0.087059:s3:GET /:list_bucket:completing 0> 2018-05-31 11:03:22.142 7fe53e23b700 -1 *** Caught signal (Aborted) ** in thread 7fe53e23b700 thread_name:civetweb-worker ceph version 13.1.0-162-g233ee97 (233ee9764e2d1702dde72165d9660fec89ab56fc) mimic (rc) 1: (()+0x291290) [0x5615a7344290] 2: (()+0x11390) [0x7fe571475390] 3: (gsignal()+0x38) [0x7fe566e16428] 4: (abort()+0x16a) [0x7fe566e1802a] 5: (()+0x777ea) [0x7fe566e587ea] 6: (__fortify_fail()+0x5c) [0x7fe566efa15c] 7: (()+0x117160) [0x7fe566ef8160] 8: (()+0x1190a7) [0x7fe566efa0a7] 9: (()+0x27564a) [0x5615a732864a] 10: (mg_write()+0xc1) [0x5615a732f5a1] 11: (RGWCivetWeb::write_data(char const*, unsigned long)+0x40) [0x5615a733ab20] 12: (rgw::io::BufferingFilter<rgw::io::ChunkingFilter<rgw::io::ConLenControllingFilter<RGWCivetWeb*> > >::complete_request()+0x133) [0x5615a72c8a93] 13: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, int*)+0xfc3) [0x5615a7451643] 14: (RGWCivetWebFrontend::process(mg_connection*)+0x381) [0x5615a72c6271] 15: (()+0x28251e) [0x5615a733551e] 16: (()+0x284208) [0x5615a7337208] 17: (()+0x2846a8) [0x5615a73376a8] 18: (()+0x76ba) [0x7fe57146b6ba] 19: (clone()+0x6d) [0x7fe566ee841d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/radosgw.dl-rgw-pub01.log --- end dump of recent events ---
Files
Updated by Abhishek Lekshmanan almost 6 years ago
can you share teh ceph.conf and s3cmd.cfg
Updated by David Galloway almost 6 years ago
# cat .s3cfg [default] access_key = XXXXXXXXXXXX access_token = add_encoding_exts = add_headers = bucket_location = US ca_certs_file = cache_file = check_ssl_certificate = True check_ssl_hostname = True #cloudfront_host = cloudfront.amazonaws.com default_mime_type = binary/octet-stream delay_updates = False delete_after = False delete_after_fetch = False delete_removed = False dry_run = False enable_multipart = True encoding = UTF-8 encrypt = False expiry_date = expiry_days = expiry_prefix = follow_symlinks = False force = False get_continue = False gpg_command = /usr/bin/gpg gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s gpg_passphrase = guess_mime_type = True host_base = rdu.download.ceph.com host_bucket = %(bucket)s.rdu.download.ceph.com human_readable_sizes = False invalidate_default_index_on_cf = False invalidate_default_index_root_on_cf = True invalidate_on_cf = False kms_key = limitrate = 0 list_md5 = False log_target_prefix = long_listing = False max_delete = -1 mime_type = multipart_chunk_size_mb = 15 multipart_max_chunks = 10000 preserve_attrs = True progress_meter = True proxy_host = proxy_port = 0 put_continue = False recursive = False recv_chunk = 65536 reduced_redundancy = False requester_pays = False restore_days = 1 secret_key = XXXXXXXXXX send_chunk = 65536 server_side_encryption = False signature_v2 = False simpledb_host = sdb.amazonaws.com skip_existing = False socket_timeout = 300 stats = False stop_on_error = False storage_class = urlencoding_mode = normal use_https = True #use_https = False use_mime_magic = True verbosity = WARNING #website_endpoint = http://%(bucket).rdu-website.download.ceph.com/ #website_endpoint = http://rdu-website.download.ceph.com/ website_error = website_index = index.html
# cat /etc/ceph/ceph.conf <snip> [client.radosgw.dl-rgw-pub01] log file = /var/log/ceph/$id.log rgw dns name = rdu.download.ceph.com rgw dns s3website name = rdu-website.download.ceph.com rgw thread pool size = 1024 rgw enable ops log = false ms dispatch throttle bytes = 10737418240 rgw cache lru size = 10000 admin socket = /var/run/ceph/ceph-$name.asok objecter inflight op bytes = 10737418240 objecter inflight ops = 4096 rgw frontends = civetweb port=[::]:7480 access_log_file=/var/log/civetweb/access.log error_log_file=/var/log/civetweb/error.log request_timeout_ms=300000 admin socket = /var/run/ceph/ceph-$name.asok rgw cache enabled = 1 rgw resolve cname = true rgw enable usage log = true rgw expose bucket = true rgw zonegroup = default rgw_enable_apis = s3, swift, swift_auth, admin # no Website config rgw zone = default rgw enable static website = true rgw_max_chunk_size = 4194304
Updated by Yehuda Sadeh almost 6 years ago
- Status changed from New to Triaged
- Assignee set to Abhishek Lekshmanan
- Priority changed from Normal to High
Updated by Yehuda Sadeh almost 6 years ago
- Subject changed from mimic: s3cmd sync fails to rgw: s3cmd sync fails
Updated by Abhishek Lekshmanan almost 6 years ago
Is this still reproducible in the setup, I tried reproducing this locally, but I'm not successful wkith that
Updated by David Galloway almost 6 years ago
I can't reproduce using a different folder from my workstation but I do still get the same error when trying to sync from download.ceph.com to the RGW. Are there more logs or debug output I could gather for you?
Updated by Abhishek Lekshmanan almost 6 years ago
Is this reproducible still with latest mimic (13.2.0)?
Updated by David Galloway almost 6 years ago
I updated the RGWs to 13.2.0 and am still seeing the same issue.
Updated by Abhishek Lekshmanan over 5 years ago
Can we have this reproduced at debug rgw=20 debug ms=1.
Updated by Abhishek Lekshmanan over 5 years ago
I was able to log in the node and see the bt, the bt itself arose from FD_SET where a fileno > 1024 was passed which was the configured ulimit for nofile in the system, increasing ulimit should hopefully solve the issue
Updated by Abhishek Lekshmanan over 5 years ago
- Status changed from Triaged to Fix Under Review
Updated by Abhishek Lekshmanan over 5 years ago
I was able to log in the node and see the bt, the bt itself arose from FD_SET where a fileno > 1024 was passed which was the configured ulimit for nofile in the system, increasing ulimit should hopefully solve the issue
Updated by David Galloway over 5 years ago
No dice even with the following set
root@dl-rgw-pub01:~# ps -u ceph PID TTY TIME CMD 2787067 ? 00:02:01 radosgw root@dl-rgw-pub01:~# grep 'open files' /proc/2787067/limits Max open files 1048576 1048576 files
Updated by Kevin McCarthy over 5 years ago
I am seeing exact same bug:
in thread 7f5644c45700 thread_name:civetweb-worker
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
1: (()+0x281670) [0x55fd3e88f670]
2: (()+0xf6d0) [0x7f577208c6d0]
3: (gsignal()+0x37) [0x7f57664f0277]
4: (abort()+0x148) [0x7f57664f1968]
5: (()+0x78d37) [0x7f5766532d37]
6: (__fortify_fail()+0x37) [0x7f57665d26e7]
7: (()+0x116862) [0x7f57665d0862]
8: (()+0x118647) [0x7f57665d2647]
9: (()+0x2655da) [0x55fd3e8735da]
10: (mg_write()+0xc1) [0x55fd3e87a521]
11: (RGWCivetWeb::write_data(char const*, unsigned long)+0x40) [0x55fd3e885ae0]
12: (rgw::io::BufferingFilter<rgw::io::ChunkingFilter<rgw::io::ConLenControllingFilter<RGWCivetWeb*> > >::send_body(char const*, unsigned long)+0xf8) [0x55fd3e812f78]
13: (rgw::io::AccountingFilter<rgw::io::RestfulClient*>::send_body(char const*, unsigned long)+0x15) [0x55fd3e80fe05]
14: (dump_body(req_state*, char const*, unsigned long)+0x43) [0x55fd3ea22bd3]
15: (RGWGetObj_ObjStore_S3::send_response_data(ceph::buffer::list&, long, long)+0x280) [0x55fd3ea7a2a0]
16: (RGWGetObj::get_data_cb(ceph::buffer::list&, long, long)+0x67) [0x55fd3e94ffb7]
17: (RGWRados::get_obj_iterate_cb(RGWObjectCtx*, RGWObjState*, RGWBucketInfo const&, rgw_obj const&, rgw_raw_obj const&, long, long, long, bool, void*)+0xea) [0x55fd3e9d4d9a]
18: (()+0x3c7596) [0x55fd3e9d5596]
19: (RGWRados::iterate_obj(RGWObjectCtx&, RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int ()(RGWBucketInfo const&, rgw_obj const&, rgw_raw_obj const&, long, long, long, bool, RGWObjState, void*), void*)+0x3f9) [0x55fd3e9d7f69]
20: (RGWRados::Object::Read::iterate(long, long, RGWGetDataCB*)+0x2a1) [0x55fd3e9d83c1]
21: (RGWGetObj::execute()+0xc5c) [0x55fd3e976ebc]
22: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x1cd) [0x55fd3e9903ad]
23: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, int*)+0xe4c) [0x55fd3e9912fc]
24: (RGWCivetWebFrontend::process(mg_connection*)+0x37c) [0x55fd3e810f9c]
25: (()+0x27249e) [0x55fd3e88049e]
26: (()+0x274188) [0x55fd3e882188]
27: (()+0x274628) [0x55fd3e882628]
28: (()+0x7e25) [0x7f5772084e25]
29: (clone()+0x6d) [0x7f57665b8bad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Any resolution?
Updated by Kevin McCarthy over 5 years ago
Browsing buckets with large object/file counts seems to trigger this. I updated the NOFILE to 10X and it did not help.
Updated by Abhishek Lekshmanan over 5 years ago
Was able to trigger this, have a fix staged at https://github.com/ceph/civetweb/pull/29
Updated by Abhishek Lekshmanan over 5 years ago
Kevin, For now, if you're blocked, changing the rgw_frontends string to use beast instead of civetweb will get you moving, until this patch is backported to mimic.
Updated by Abhishek Lekshmanan over 5 years ago
- Copied to Backport #35954: mimic: rgw: s3cmd sync fails added
Updated by Kevin McCarthy over 5 years ago
Thank you very much for the quick fix. Using beast seems to work for now, and will change back to civetweb once this patch is GA.
Updated by Nathan Cutler over 5 years ago
- Status changed from Fix Under Review to Resolved
- Backport set to mimic