Bug #8676
closedmd5sum check failed during readwrite.py
0%
Description
ubuntu@teuthology:/a/teuthology-2014-06-26_07:41:24-rgw-next-testing-basic-plana/327692
2014-06-26T10:29:04.947 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,awoudpindbhywuzluauyr> 2014-06-26T10:29:04.991 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,awxddekzhsdkqwen> 2014-06-26T10:29:05.158 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,bkswcduovun> 2014-06-26T10:29:05.332 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,iftymguhvilngr> 2014-06-26T10:29:05.472 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,oqfipdargbkbqhmlgonfl> 2014-06-26T10:29:05.572 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,qlzaf> 2014-06-26T10:29:05.733 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,rtfltmhvglg> 2014-06-26T10:29:05.842 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,uajlowdmsshitnrmwkde> 2014-06-26T10:29:06.229 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,ykwzii> 2014-06-26T10:29:06.368 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,yriilkltiksmdmyplulfdlnnl> 2014-06-26T10:29:06.660 INFO:teuthology.orchestra.run.plana49.stderr:Traceback (most recent call last): 2014-06-26T10:29:06.660 INFO:teuthology.orchestra.run.plana49.stderr: File "/home/ubuntu/cephtest/s3-tests/virtualenv/bin/s3tests-test-readwrite", line 9, in <module> 2014-06-26T10:29:06.660 INFO:teuthology.orchestra.run.plana49.stderr: load_entry_point('s3tests==0.0.1', 'console_scripts', 's3tests-test-readwrite')() 2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr: File "/home/ubuntu/cephtest/s3-tests/s3tests/readwrite.py", line 255, in main 2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr: trace=temp_dict['error']['traceback']) 2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr:Exception: exception: 2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr: md5sum check failed 2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr: None 2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr: 2014-06-26T10:29:06.662 INFO:teuthology.orchestra.run.plana49.stderr:Exception KeyError: KeyError(35399792,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-07-19_13:59:16-rgw-wip-8701-testing-basic-plana/370152
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/a/sage-2014-07-25_15:29:13-rgw-wip-msgr-testing-basic-plana/377978
Updated by Yehuda Sadeh almost 10 years ago
I don't see anything wrong in the logs other than this:
2014-06-26 10:29:04.641601 7ffe417fa700 0 RGWObjManifest::operator++(): result: ofs=1574768 stripe_ofs=1574768 part_ofs=0 rule->part_size=0 2014-06-26 10:29:04.646731 7ffe55fc3700 0 ERROR: s->cio->send_status() returned err=-1 2014-06-26 10:29:04.646751 7ffe55fc3700 0 ERROR: s->cio->print() returned err=-1 2014-06-26 10:29:04.646754 7ffe55fc3700 0 ERROR: s->cio->print() returned err=-1 2014-06-26 10:29:04.646756 7ffe55fc3700 0 ERROR: s->cio->print() returned err=-1 2014-06-26 10:29:04.646757 7ffe55fc3700 0 ERROR: s->cio->complete_header() returned err=-1 2014-06-26 10:29:04.646880 7ffe55fc3700 1 ====== req done req=0x7ffe74057260 http_status=403 ======
but it's hard to tell whether it's the cause or the consequence. It usually means that the client hung up, but that might be due to the client having encountered an error previously.
Updated by Sage Weil almost 10 years ago
ubuntu@teuthology:/a/sage-2014-08-04_11:34:19-rgw-next-testing-basic-vps/397522
Updated by Yehuda Sadeh over 9 years ago
Something I noticed now:
10.214.132.29 l - [26/Jun/2014:10:29:03 -0700] "GET /rwtest/qlzaf HTTP/1.1" 200 1949710 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 10.214.132.29 l - [26/Jun/2014:10:29:03 -0700] "GET /rwtest/qlzaf HTTP/1.1" 200 1949710 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 10.214.132.29 l - [26/Jun/2014:10:29:04 -0700] "GET /rwtest/qlzaf HTTP/1.1" 200 1949710 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 10.214.132.29 l - [26/Jun/2014:10:29:03 -0700] "PUT /rwtest/qlzaf HTTP/1.1" 200 - "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 10.214.132.29 l - [26/Jun/2014:10:29:04 -0700] "GET /rwtest/qlzaf HTTP/1.1" 103 24576 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93"
Client sends get, get, put, get, get. However, put ends up completing longer, and the get after the put completion fails with 103, which means that the client hung up. This might imply that the problem is that the client's md5 check is racy. There's a window in which we can't be sure whether we'd get the md5 value of the current, or of the previous object.
Updated by Yehuda Sadeh over 9 years ago
Well, it seems that the md5 digest is embedded in the object's data, so even if writes are racing (as they are), it shouldn't matter.
Updated by Yehuda Sadeh over 9 years ago
After some more research: The 103 is part of the normal shutdown sequence. Still might be a client teardown handling issue, but the logs don't help. The run would continue as normal, even if there's some issue in the middle and we'd only find out about it at the end of the run. Pushed a change that adds some logging during the run in case of md5 error, we'll need to wait for next occurance.
Updated by Yehuda Sadeh over 9 years ago
There's a chance this one is the same as #9307
Updated by Yehuda Sadeh over 9 years ago
- Priority changed from Urgent to High
This might have been fixed, downgrading it for now until it's dis/proved.
Updated by Sage Weil over 9 years ago
- Status changed from 12 to Duplicate
this appears to be resolved by #9307