Bug #15727
closedEIO during readwrite test
0%
Description
2016-05-04 08:08:27.295583 7fb935ff3700 1 civetweb: 0x7fb99c0008e0: 172.21.15.67 - - [04/May/2016:08:08:27 +0000] "PUT /rwtest/sxpbzfkbuplsnbgnck HTTP/1.1" 200 0 - Boto/2.40.0 Python/2.7.6 Linux/3.13.0-85-generic 2016-05-04 08:08:27.299134 7fb931feb700 1 ====== starting new request req=0x7fb931fe57d0 ===== 2016-05-04 08:08:27.300350 7fb931feb700 0 WARNING: set_req_state_err err_no=5 resorting to 500 2016-05-04 08:08:27.300369 7fb931feb700 0 ERROR: s->cio->send_content_length() returned err=-5 2016-05-04 08:08:27.300372 7fb931feb700 0 ERROR: s->cio->print() returned err=-5 2016-05-04 08:08:27.300373 7fb931feb700 0 ERROR: STREAM_IO(s)->print() returned err=-5 2016-05-04 08:08:27.300377 7fb931feb700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
teuthology fails like so:
2016-05-04T01:08:29.974 INFO:teuthology.orchestra.run.smithi067.stdout: worker: 1} 2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 31138897, key: zurtbyfljkhier, start: 1462349138.280538, 2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout: type: r, worker: 7} 2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 16851902, key: abjsffwivpbq, start: 1462349138.30514, type: r, 2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout: worker: 1} 2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 14437199, key: adssaryckkj, start: 1462349138.316177, type: r, 2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout: worker: 7} 2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 19171953, key: hjzv, start: 1462349138.333237, type: r, 2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stdout: worker: 7} 2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 17049789, key: nfgyaikfhe, start: 1462349138.357968, type: r, 2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stdout: worker: 7} 2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,abjsffwivpbq> 2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,adihtqwcrtnwnxsjke> 2016-05-04T01:08:29.976 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 0, 3] out_osds: [5, 4] dead_osds: [] live_osds: [4, 0, 1, 3, 5, 2] 2016-05-04T01:08:29.977 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0 2016-05-04T01:08:29.977 INFO:tasks.thrashosds.thrasher:fixing pg num pool rbd 2016-05-04T01:08:29.977 INFO:teuthology.orchestra.run.smithi067:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json' 2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,adssaryckkj> 2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,ddsxbmwmbfiejywsq> 2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,hjzv> 2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,mluirfqd> 2016-05-04T01:08:29.992 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,nfgyaikfhe> 2016-05-04T01:08:30.008 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,odobxwoxcllsrjtoryn> 2016-05-04T01:08:30.023 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,sxpbzfkbuplsnbgnck> 2016-05-04T01:08:30.037 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,zurtbyfljkhier> 2016-05-04T01:08:30.109 INFO:teuthology.orchestra.run.smithi067.stderr:Traceback (most recent call last): 2016-05-04T01:08:30.110 INFO:teuthology.orchestra.run.smithi067.stderr: File "/home/ubuntu/cephtest/s3-tests/virtualenv/bin/s3tests-test-readwrite", line 9, in <module> 2016-05-04T01:08:30.110 INFO:teuthology.orchestra.run.smithi067.stderr: load_entry_point('s3tests', 'console_scripts', 's3tests-test-readwrite')() 2016-05-04T01:08:30.110 INFO:teuthology.orchestra.run.smithi067.stderr: File "/home/ubuntu/cephtest/s3-tests/s3tests/readwrite.py", line 256, in main 2016-05-04T01:08:30.111 INFO:teuthology.orchestra.run.smithi067.stderr: trace=temp_dict['error']['traceback']) 2016-05-04T01:08:30.111 INFO:teuthology.orchestra.run.smithi067.stderr:Exception: exception: 2016-05-04T01:08:30.111 INFO:teuthology.orchestra.run.smithi067.stderr: timed out ...
/a/sage-2016-05-03_05:21:18-rados-wip-sage-testing2---basic-smithi/165816
I've seen several of these in the last week.
Updated by Sage Weil almost 8 years ago
/a/yuriw-2016-05-05_09:55:50-rados-wip-yuri-testing2---basic-smithi/172261
Updated by Sage Weil almost 8 years ago
enabled more debug in the rados task (rgw 20, ms 1)
Updated by Sage Weil almost 8 years ago
/a/yuriw-2016-05-07_09:11:04-rados-wip-yuri-testing---basic-smithi/176703
with logs
Updated by Yehuda Sadeh almost 8 years ago
some really slow rados operations there:
2016-05-07 18:45:36.853308 7fdaa97fa700 1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb6005aa00 sd=21 :-1 s=STATE_OPEN pgs=104 cs=1 l=1). == tx == 0x7fda7801f030 osd_op(unknown.0.0:9429 12.7f38311 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] snapc 0=[] ack+ondisk+write+known_if_redirected e89) v7 ... 2016-05-07 18:46:27.802345 7fdb6dd7d700 1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb5c639b80 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=1). == tx == 0x7fdb5c0da080 osd_op(unknown.0.0:9429 12.7f38311 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] snapc 0=[] RETRY=1 ack+ondisk+retry+write+known_if_redirected e117) v7 ... 2016-05-07 18:46:27.812301 7fdb6d57c700 1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb50035f60 sd=21 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=1). == tx == 0x7fdb50023660 osd_op(unknown.0.0:9429 12.7f38311 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] snapc 0=[] RETRY=2 ack+ondisk+retry+write+known_if_redirected e117) v7 ... 2016-05-07 18:47:48.085324 7fdb6cd7b700 1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb50035f60 sd=21 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=109 cs=1 l=1). == rx == osd.1 seq 125 0x7fdb60368010 osd_op_reply(9429 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] v89'156 uv156 ondisk = 0) v7
Updated by Samuel Just almost 8 years ago
Probably also /a/samuelj-2016-05-06_13:08:25-rados-wip-sam-testing-jewel-distro-basic-smithi/174652
Updated by Piotr Nowosielski almost 8 years ago
This is a serious problem. When occur - operations on the cluster increases dramatically. (my configuration: two clusters with Hammer, 100 OSD each).
Updated by Orit Wasserman almost 8 years ago
did it reproduce again in master?
what version?
Updated by Peter Sabaini almost 8 years ago
We might have had something similar, an episode of about 20min. Redacted log excerpt:
2016-06-16 11:13:02.939470 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=1900 http_status=201 ====== 2016-06-16 11:13:02.939500 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:05:35 +0000] "PUT /swift/xxx HTTP/1.1" 201 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.939548 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.940345 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.940356 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ====== 2016-06-16 11:13:02.940372 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.940425 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.940965 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.940976 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ====== 2016-06-16 11:13:02.940990 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.941021 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.941641 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.941652 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ====== 2016-06-16 11:13:02.941668 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.941699 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.942308 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.942319 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ====== 2016-06-16 11:13:02.942335 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.942367 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.943020 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.943031 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ====== 2016-06-16 11:13:02.943047 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.943081 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.943646 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.943655 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ====== 2016-06-16 11:13:02.943669 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.943699 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.944206 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.944216 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ====== 2016-06-16 11:13:02.944229 7f5da67fc700 1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91 2016-06-16 11:13:02.944271 7f5da67fc700 1 ====== starting new request req=0x7f5da67f67d0 ===== 2016-06-16 11:13:02.962268 7f5da67fc700 0 validated token: services:ceilometer expires: 1466078787 2016-06-16 11:13:02.965641 7f5da67fc700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5 2016-06-16 11:13:02.965658 7f5da67fc700 1 ====== req done req=0x7f5da67f67d0 op status=0 http_status=404 ======
Updated by Peter Sabaini almost 8 years ago
The above ^^ is from a 200OSD cluster running 10.2.0 fwiw
Updated by Jarek Owsiewski almost 8 years ago
Piotr Nowosielski wrote:
This is a serious problem. When occur - operations on the cluster increases dramatically. (my configuration: two clusters with Hammer, 100 OSD each).
Now we have 153 OSD's in each cluster. Situation didn't change:
"2016-06-17 10:33:07.740353 7efc78f41700 0 ERROR: s->cio->print() returned err=-5
2016-06-17 10:33:07.740357 7efc78f41700 0 ERROR: s->cio->print() returned err=-5
2016-06-17 10:33:07.740360 7efc78f41700 0 ERROR: s->cio->print() returned err=-5
2016-06-17 10:33:07.740371 7efc78f41700 0 ERROR: s->cio->complete_header() returned err=-5"
Issue occurs mostly during the deep-scrub operation and when troughput is on high level - 400k PUT/HEAD per hour. GET requests are not a problem in this situation.
Updated by Yehuda Sadeh almost 8 years ago
- Assignee set to Samuel Just
Seems like osd side slowness.
Updated by Samuel Just over 7 years ago
- Assignee deleted (
Samuel Just) - Priority changed from Urgent to High
Yehuda: I don't quite follow, does RGW report -EIO if the OSD is slow?
Updated by Sage Weil over 7 years ago
- Status changed from New to Fix Under Review
- Priority changed from High to Urgent
The problem is simple: we are thrashing with min_size=size=2, and the readwrite test doesn't tolerate really slow rados ops.. which can be as slow as the thrasher decides to keep more than 0 osds down at any one time.
I think we should just remove the readwrite test from this part of the test matrix.
Updated by Sage Weil over 7 years ago
- Status changed from Fix Under Review to Resolved