Project

General

Profile

Actions

Bug #8676

closed

md5sum check failed during readwrite.py

Added by Sage Weil almost 10 years ago. Updated over 9 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-06-26_07:41:24-rgw-next-testing-basic-plana/327692

2014-06-26T10:29:04.947 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,awoudpindbhywuzluauyr>
2014-06-26T10:29:04.991 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,awxddekzhsdkqwen>
2014-06-26T10:29:05.158 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,bkswcduovun>
2014-06-26T10:29:05.332 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,iftymguhvilngr>
2014-06-26T10:29:05.472 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,oqfipdargbkbqhmlgonfl>
2014-06-26T10:29:05.572 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,qlzaf>
2014-06-26T10:29:05.733 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,rtfltmhvglg>
2014-06-26T10:29:05.842 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,uajlowdmsshitnrmwkde>
2014-06-26T10:29:06.229 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,ykwzii>
2014-06-26T10:29:06.368 INFO:teuthology.orchestra.run.plana49.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,yriilkltiksmdmyplulfdlnnl>
2014-06-26T10:29:06.660 INFO:teuthology.orchestra.run.plana49.stderr:Traceback (most recent call last):
2014-06-26T10:29:06.660 INFO:teuthology.orchestra.run.plana49.stderr:  File "/home/ubuntu/cephtest/s3-tests/virtualenv/bin/s3tests-test-readwrite", line 9, in <module>
2014-06-26T10:29:06.660 INFO:teuthology.orchestra.run.plana49.stderr:    load_entry_point('s3tests==0.0.1', 'console_scripts', 's3tests-test-readwrite')()
2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr:  File "/home/ubuntu/cephtest/s3-tests/s3tests/readwrite.py", line 255, in main
2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr:    trace=temp_dict['error']['traceback'])
2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr:Exception: exception:
2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr:   md5sum check failed
2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr:   None
2014-06-26T10:29:06.661 INFO:teuthology.orchestra.run.plana49.stderr:
2014-06-26T10:29:06.662 INFO:teuthology.orchestra.run.plana49.stderr:Exception KeyError: KeyError(35399792,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored
Actions #1

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-07-19_13:59:16-rgw-wip-8701-testing-basic-plana/370152

Actions #2

Updated by Sage Weil almost 10 years ago

ubuntu@teuthology:/a/sage-2014-07-25_15:29:13-rgw-wip-msgr-testing-basic-plana/377978

Actions #3

Updated by Yehuda Sadeh over 9 years ago

I don't see anything wrong in the logs other than this:

2014-06-26 10:29:04.641601 7ffe417fa700  0 RGWObjManifest::operator++(): result: ofs=1574768 stripe_ofs=1574768 part_ofs=0 rule->part_size=0
2014-06-26 10:29:04.646731 7ffe55fc3700  0 ERROR: s->cio->send_status() returned err=-1
2014-06-26 10:29:04.646751 7ffe55fc3700  0 ERROR: s->cio->print() returned err=-1
2014-06-26 10:29:04.646754 7ffe55fc3700  0 ERROR: s->cio->print() returned err=-1
2014-06-26 10:29:04.646756 7ffe55fc3700  0 ERROR: s->cio->print() returned err=-1
2014-06-26 10:29:04.646757 7ffe55fc3700  0 ERROR: s->cio->complete_header() returned err=-1
2014-06-26 10:29:04.646880 7ffe55fc3700  1 ====== req done req=0x7ffe74057260 http_status=403 ======

but it's hard to tell whether it's the cause or the consequence. It usually means that the client hung up, but that might be due to the client having encountered an error previously.

Actions #4

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/a/sage-2014-08-04_11:34:19-rgw-next-testing-basic-vps/397522

Actions #5

Updated by Yehuda Sadeh over 9 years ago

Something I noticed now:

10.214.132.29 l - [26/Jun/2014:10:29:03 -0700] "GET /rwtest/qlzaf HTTP/1.1" 200 1949710 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 
10.214.132.29 l - [26/Jun/2014:10:29:03 -0700] "GET /rwtest/qlzaf HTTP/1.1" 200 1949710 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 
10.214.132.29 l - [26/Jun/2014:10:29:04 -0700] "GET /rwtest/qlzaf HTTP/1.1" 200 1949710 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 
10.214.132.29 l - [26/Jun/2014:10:29:03 -0700] "PUT /rwtest/qlzaf HTTP/1.1" 200 - "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 
10.214.132.29 l - [26/Jun/2014:10:29:04 -0700] "GET /rwtest/qlzaf HTTP/1.1" 103 24576 "{Referer}i" "Boto/2.29.1 Python/2.7.3 Linux/3.16.0-rc2-ceph-00016-g2172e93" 

Client sends get, get, put, get, get. However, put ends up completing longer, and the get after the put completion fails with 103, which means that the client hung up. This might imply that the problem is that the client's md5 check is racy. There's a window in which we can't be sure whether we'd get the md5 value of the current, or of the previous object.

Actions #6

Updated by Yehuda Sadeh over 9 years ago

Well, it seems that the md5 digest is embedded in the object's data, so even if writes are racing (as they are), it shouldn't matter.

Actions #7

Updated by Yehuda Sadeh over 9 years ago

After some more research: The 103 is part of the normal shutdown sequence. Still might be a client teardown handling issue, but the logs don't help. The run would continue as normal, even if there's some issue in the middle and we'd only find out about it at the end of the run. Pushed a change that adds some logging during the run in case of md5 error, we'll need to wait for next occurance.

Actions #8

Updated by Yehuda Sadeh over 9 years ago

There's a chance this one is the same as #9307

Actions #9

Updated by Yehuda Sadeh over 9 years ago

  • Priority changed from Urgent to High

This might have been fixed, downgrading it for now until it's dis/proved.

Actions #10

Updated by Sage Weil over 9 years ago

  • Status changed from 12 to Duplicate

this appears to be resolved by #9307

Actions

Also available in: Atom PDF