Project

General

Profile

Backport #23237

Corrupted downloads from civetweb when using multipart with slow connections

Added by Mustafa Muhammad almost 2 years ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Release:
luminous
Crash signature:

Description

[backport pr https://github.com/ceph/ceph/pull/27982]

We have a problem in civetweb downloads, our clients have slow connections and download large files using multi-segment downloads (download managers), using fastcgi, everything works fine, using civetweb, lots of downloads are corrupted.
We were able to reproduce the problem, using these steps:

The RGW uses "rgw_frontends = fastcgi, civetweb port=7480" and nginx listens on port 8080 with "fastcgi_pass unix:/run/ceph/ceph.radosgw.gateway.fastcgi.sock"

Inside a test dir, run these commands

1) Download using 8 segments, with 350 KB/s speed limit, using fastcgi with nginx, no problems (all MD5s match)

for i in {1..10}; do aria2c --max-overall-download-limit=350K -s 8 -x 8 -o test$i "http://192.168.217.234:8080/mustafa-test/CentOS-7-x86_64-NetInstall-1708.iso?AWSAccessKeyId=ZTCEWB8UCH1HR47ZQJYT&Expires=1521316274&Signature=XL4E4s8%2F%2FUiL6mFPnby%2BAyokvIk%3D" && md5sum test$i >> testmd5 & done

cat testmd5
8fc7bb301bccbf84030ff4faf3a4bbb5 test8
8fc7bb301bccbf84030ff4faf3a4bbb5 test9
8fc7bb301bccbf84030ff4faf3a4bbb5 test2
8fc7bb301bccbf84030ff4faf3a4bbb5 test1
8fc7bb301bccbf84030ff4faf3a4bbb5 test4
8fc7bb301bccbf84030ff4faf3a4bbb5 test10
8fc7bb301bccbf84030ff4faf3a4bbb5 test5
8fc7bb301bccbf84030ff4faf3a4bbb5 test6
8fc7bb301bccbf84030ff4faf3a4bbb5 test7
8fc7bb301bccbf84030ff4faf3a4bbb5 test3

2) Download using 1 segment, with 350 KB/s speed limit, using civetweb, no problems (all MD5s match)

rm -f test*; for i in {1..10}; do aria2c --max-overall-download-limit=350K -o test$i "http://192.168.217.234:7480/mustafa-test/CentOS-7-x86_64-NetInstall-1708.iso?AWSAccessKeyId=ZTCEWB8UCH1HR47ZQJYT&Expires=1521316274&Signature=XL4E4s8%2F%2FUiL6mFPnby%2BAyokvIk%3D" && md5sum test$i >> testmd5 & done

cat testmd5
8fc7bb301bccbf84030ff4faf3a4bbb5 test6
8fc7bb301bccbf84030ff4faf3a4bbb5 test7
8fc7bb301bccbf84030ff4faf3a4bbb5 test3
8fc7bb301bccbf84030ff4faf3a4bbb5 test2
8fc7bb301bccbf84030ff4faf3a4bbb5 test9
8fc7bb301bccbf84030ff4faf3a4bbb5 test10
8fc7bb301bccbf84030ff4faf3a4bbb5 test4
8fc7bb301bccbf84030ff4faf3a4bbb5 test8
8fc7bb301bccbf84030ff4faf3a4bbb5 test1
8fc7bb301bccbf84030ff4faf3a4bbb5 test5

3) Download using 8 segments, with 350 KB/s speed limit, using civetweb, corrupted files (different MD5 for each file)

rm -f test*; for i in {1..10}; do aria2c --max-overall-download-limit=350K -s 8 -x 8 -o test$i "http://192.168.217.234:7480/mustafa-test/CentOS-7-x86_64-NetInstall-1708.iso?AWSAccessKeyId=ZTCEWB8UCH1HR47ZQJYT&Expires=1521316274&Signature=XL4E4s8%2F%2FUiL6mFPnby%2BAyokvIk%3D" && md5sum test$i >> testmd5 & done

cat testmd5
50b2bee56e19460e165e24eb7bd490d0 test1
934b1c7e5bdeee64632d3fac89b17226 test8
13655ac02a174fb11eff02a21ee92c4d test9
5053d361eeb569057ea8df191460bf96 test7
aeb85750cec63c16214bc5a732cd3411 test2
120a4e4f77f1994f6939aa722c678749 test5
362f146feb8461deca297f6a6dd5eec0 test6
de77a8d528fece0c48b7a330c2c3dbf5 test4
475bff03d5fd6ddd0acfce3a57f048dd test3
ab87c95fc5e5561786b23ef58ac65efd test10

Starts happening when the download takes about 10~15 minutes, and increases with download time, you can see all the downloads are corrupt after 20~25 minutes (last case), result files have correct size but corrupt data.

We tested using the publicly available file CentOS-7-x86_64-NetInstall-1708.iso so you can reproduce exactly as we did.
Also attached is the RGW log

rgw.logaa (774 KB) Mustafa Muhammad, 03/06/2018 07:26 AM

rgw.logab (616 KB) Mustafa Muhammad, 03/06/2018 07:26 AM

History

#1 Updated by Orit Wasserman almost 2 years ago

  • Assignee set to Orit Wasserman

#2 Updated by Yehuda Sadeh almost 2 years ago

  • Priority changed from Normal to High

#3 Updated by Orit Wasserman almost 2 years ago

  • Status changed from New to Triaged

#4 Updated by Orit Wasserman over 1 year ago

  • Assignee changed from Orit Wasserman to Mark Kogan

#5 Updated by Konstantin Shalygin about 1 year ago

Can't reproduce with 12.2.10 and CentOS-7-x86_64-NetInstall-1810.iso

d74ea11d73e7183fbbd8dcdc4f1a74a5  test1
d74ea11d73e7183fbbd8dcdc4f1a74a5  test2
d74ea11d73e7183fbbd8dcdc4f1a74a5  test10
d74ea11d73e7183fbbd8dcdc4f1a74a5  test3
d74ea11d73e7183fbbd8dcdc4f1a74a5  test4
d74ea11d73e7183fbbd8dcdc4f1a74a5  test7
d74ea11d73e7183fbbd8dcdc4f1a74a5  test6
d74ea11d73e7183fbbd8dcdc4f1a74a5  test5
d74ea11d73e7183fbbd8dcdc4f1a74a5  test8
d74ea11d73e7183fbbd8dcdc4f1a74a5  test9

(without any nginx'es) rgw's config: "rgw frontends = civetweb port=0.0.0.0:80r+443s enable_keep_alive=yes ssl_protocol_version=4 ssl_certificate=<snip_snap> ssl_cipher_list=ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"

#6 Updated by Mustafa Muhammad about 1 year ago

I just reproduced with version 12.2.10, make sure to use the download limit and segmentation I used, sot it is 8 segment download that takes more the 25 minutes, I also put the RGW under some load (I used goreplay to take traffic from one of our servers to this).

This time I used:
[root@rgw04 test]# rm -f test*; for i in {1..10}; do aria2c --max-overall-download-limit=350K -s 8 -x 8 -o test$i "http://10.1.25.32:7480/mustafa-test/CentOS-7-x86_64-NetInstall-1810.iso?AWSAccessKeyId=3VZ9NVYGOW358FIF7T05&Expires=1555735033&Signature=r52yDaauHiUsS6fHur6m76FQCfY%3D" && md5sum test$i >> testmd5 & done

[root@rgw04 test]# cat testmd5
d74ea11d73e7183fbbd8dcdc4f1a74a5 test3
d74ea11d73e7183fbbd8dcdc4f1a74a5 test10
d74ea11d73e7183fbbd8dcdc4f1a74a5 test7
d74ea11d73e7183fbbd8dcdc4f1a74a5 test2
d74ea11d73e7183fbbd8dcdc4f1a74a5 test1
d74ea11d73e7183fbbd8dcdc4f1a74a5 test4
2f26170ad562693b0d67ce78fb4a2488 test8
d74ea11d73e7183fbbd8dcdc4f1a74a5 test5
d74ea11d73e7183fbbd8dcdc4f1a74a5 test6
92bd72d17535d5fa4a91cc418ce34840 test9

You can see some are corrupted.

#7 Updated by Casey Bodley 12 months ago

  • Assignee changed from Mark Kogan to Casey Bodley

#8 Updated by Casey Bodley 11 months ago

  • Status changed from Triaged to In Progress

planning to update the civetweb to match what's in mimic, which fixes the keepalive handling. the civetweb branch also needs cve fixes from https://github.com/ceph/ceph/pull/26590

#9 Updated by Casey Bodley 10 months ago

  • Status changed from In Progress to Closed

i see this as a client bug more than anything - aria2c does not retry or reconnect on errors, and reports 'success' even if there were errors. this depends on the server keeping connections alive forever, which civetweb did not do in the version we used for luminous. we've decided that backporting the civetweb changes isn't worth the risk. the mimic and nautilus releases do have the keepalive fixes, and the beast frontend is another option.

#10 Updated by Mustafa Muhammad 10 months ago

aria2 is not the only client facing this problem, our users use Internet Download Manager and multiple other Windows download managers, they are facing the problem too. This is an LTS release and the issue causes corrupted data for the users, I really think this should be fixed or the fixes backported.

#11 Updated by Mustafa Muhammad 10 months ago

In this RHCS bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1670321
You can see "CivetWeb connections timed out despite the `enable_keep_alive` option enabled. Consequently, S3 clients that did not reconnect or retry were not reliable. With this update, CivetWeb has been updated, and the `enable_keep_alive` option works as expected. As a result, CivetWeb connections no longer time out in this case."

It was clearly a problem with the enable_keep_alive option of civetweb, and the problem was fixed, so please let this fix go to the community edition too.

#12 Updated by Casey Bodley 10 months ago

  • Status changed from Closed to In Progress
  • Target version deleted (v12.2.2)

#13 Updated by Casey Bodley 10 months ago

  • Tracker changed from Bug to Backport
  • Release set to luminous

thanks for the feedback. it sounds like we are planning at least one more luminous release, so i've staged a backport for testing at https://github.com/ceph/ceph/pull/27982

#14 Updated by Casey Bodley 10 months ago

  • Description updated (diff)

#16 Updated by Nathan Cutler 6 months ago

  • Status changed from In Progress to Resolved
  • Target version set to v12.2.13

This update was made using the script "backport-resolve-issue".
backport PR https://github.com/ceph/ceph/pull/27982
merge commit da246edbc62460c349adb76877b9ca1f6611b9b6 (v12.2.12-72-gda246edbc6)

Also available in: Atom PDF