Project

General

Profile

Bug #15727

EIO during readwrite test

Added by Sage Weil over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
05/04/2016
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

2016-05-04 08:08:27.295583 7fb935ff3700  1 civetweb: 0x7fb99c0008e0: 172.21.15.67 - - [04/May/2016:08:08:27 +0000] "PUT /rwtest/sxpbzfkbuplsnbgnck HTTP/1.1" 200 0 - Boto/2.40.0 Python/2.7.6 Linux/3.13.0-85-generic
2016-05-04 08:08:27.299134 7fb931feb700  1 ====== starting new request req=0x7fb931fe57d0 =====
2016-05-04 08:08:27.300350 7fb931feb700  0 WARNING: set_req_state_err err_no=5 resorting to 500
2016-05-04 08:08:27.300369 7fb931feb700  0 ERROR: s->cio->send_content_length() returned err=-5
2016-05-04 08:08:27.300372 7fb931feb700  0 ERROR: s->cio->print() returned err=-5
2016-05-04 08:08:27.300373 7fb931feb700  0 ERROR: STREAM_IO(s)->print() returned err=-5
2016-05-04 08:08:27.300377 7fb931feb700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5

teuthology fails like so:
2016-05-04T01:08:29.974 INFO:teuthology.orchestra.run.smithi067.stdout:  worker: 1}
2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 31138897, key: zurtbyfljkhier, start: 1462349138.280538,
2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:  type: r, worker: 7}
2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 16851902, key: abjsffwivpbq, start: 1462349138.30514, type: r,
2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:  worker: 1}
2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 14437199, key: adssaryckkj, start: 1462349138.316177, type: r,
2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:  worker: 7}
2016-05-04T01:08:29.975 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 19171953, key: hjzv, start: 1462349138.333237, type: r,
2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stdout:  worker: 7}
2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stdout:{bucket: rwtest, duration: 17049789, key: nfgyaikfhe, start: 1462349138.357968, type: r,
2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stdout:  worker: 7}
2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,abjsffwivpbq>
2016-05-04T01:08:29.976 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,adihtqwcrtnwnxsjke>
2016-05-04T01:08:29.976 INFO:tasks.thrashosds.thrasher:in_osds:  [1, 2, 0, 3] out_osds:  [5, 4] dead_osds:  [] live_osds:  [4, 0, 1, 3, 5, 2]
2016-05-04T01:08:29.977 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2016-05-04T01:08:29.977 INFO:tasks.thrashosds.thrasher:fixing pg num pool rbd
2016-05-04T01:08:29.977 INFO:teuthology.orchestra.run.smithi067:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph pg dump --format=json'
2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,adssaryckkj>
2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,ddsxbmwmbfiejywsq>
2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,hjzv>
2016-05-04T01:08:29.978 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,mluirfqd>
2016-05-04T01:08:29.992 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,nfgyaikfhe>
2016-05-04T01:08:30.008 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,odobxwoxcllsrjtoryn>
2016-05-04T01:08:30.023 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,sxpbzfkbuplsnbgnck>
2016-05-04T01:08:30.037 INFO:teuthology.orchestra.run.smithi067.stderr:Cleaning bucket <Bucket: rwtest> key <Key: rwtest,zurtbyfljkhier>
2016-05-04T01:08:30.109 INFO:teuthology.orchestra.run.smithi067.stderr:Traceback (most recent call last):
2016-05-04T01:08:30.110 INFO:teuthology.orchestra.run.smithi067.stderr:  File "/home/ubuntu/cephtest/s3-tests/virtualenv/bin/s3tests-test-readwrite", line 9, in <module>
2016-05-04T01:08:30.110 INFO:teuthology.orchestra.run.smithi067.stderr:    load_entry_point('s3tests', 'console_scripts', 's3tests-test-readwrite')()
2016-05-04T01:08:30.110 INFO:teuthology.orchestra.run.smithi067.stderr:  File "/home/ubuntu/cephtest/s3-tests/s3tests/readwrite.py", line 256, in main
2016-05-04T01:08:30.111 INFO:teuthology.orchestra.run.smithi067.stderr:    trace=temp_dict['error']['traceback'])
2016-05-04T01:08:30.111 INFO:teuthology.orchestra.run.smithi067.stderr:Exception: exception:
2016-05-04T01:08:30.111 INFO:teuthology.orchestra.run.smithi067.stderr: timed out
...

/a/sage-2016-05-03_05:21:18-rados-wip-sage-testing2---basic-smithi/165816

I've seen several of these in the last week.

History

#1 Updated by Sage Weil over 3 years ago

/a/yuriw-2016-05-05_09:55:50-rados-wip-yuri-testing2---basic-smithi/172261

#2 Updated by Sage Weil over 3 years ago

enabled more debug in the rados task (rgw 20, ms 1)

#3 Updated by Sage Weil over 3 years ago

/a/yuriw-2016-05-07_09:11:04-rados-wip-yuri-testing---basic-smithi/176703

with logs

#4 Updated by Yehuda Sadeh over 3 years ago

some really slow rados operations there:

2016-05-07 18:45:36.853308 7fdaa97fa700  1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb6005aa00 sd=21 :-1 s=STATE_OPEN pgs=104 cs=1 l=1). == tx == 0x7fda7801f030 osd_op(unknown.0.0:9429 12.7f38311 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] snapc 0=[] ack+ondisk+write+known_if_redirected e89) v7
...
2016-05-07 18:46:27.802345 7fdb6dd7d700  1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb5c639b80 sd=-1 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=1). == tx == 0x7fdb5c0da080 osd_op(unknown.0.0:9429 12.7f38311 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] snapc 0=[] RETRY=1 ack+ondisk+retry+write+known_if_redirected e117) v7
...
2016-05-07 18:46:27.812301 7fdb6d57c700  1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb50035f60 sd=21 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=1). == tx == 0x7fdb50023660 osd_op(unknown.0.0:9429 12.7f38311 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] snapc 0=[] RETRY=2 ack+ondisk+retry+write+known_if_redirected e117) v7
...
2016-05-07 18:47:48.085324 7fdb6cd7b700  1 -- 172.21.15.37:0/1566506183 >> 172.21.15.37:6804/4767 conn(0x7fdb50035f60 sd=21 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=109 cs=1 l=1). == rx == osd.1 seq 125 0x7fdb60368010 osd_op_reply(9429 0f0af729-3187-4c0b-b2d8-d27660b0e8ad.4119.1__shadow_.6ZzI3BvQ3SywWN45rCX5JNoQljbjjAm_1 [write 1048576~333192] v89'156 uv156 ondisk = 0) v7

#5 Updated by Samuel Just over 3 years ago

Probably also /a/samuelj-2016-05-06_13:08:25-rados-wip-sam-testing-jewel-distro-basic-smithi/174652

#6 Updated by Piotr Nowosielski over 3 years ago

This is a serious problem. When occur - operations on the cluster increases dramatically. (my configuration: two clusters with Hammer, 100 OSD each).

#7 Updated by Orit Wasserman over 3 years ago

did it reproduce again in master?
what version?

#8 Updated by Peter Sabaini over 3 years ago

We might have had something similar, an episode of about 20min. Redacted log excerpt:

2016-06-16 11:13:02.939470 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=1900 http_status=201 ======
2016-06-16 11:13:02.939500 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:05:35 +0000] "PUT /swift/xxx HTTP/1.1" 201 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.939548 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.940345 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.940356 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ======
2016-06-16 11:13:02.940372 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.940425 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.940965 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.940976 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ======
2016-06-16 11:13:02.940990 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.941021 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.941641 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.941652 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ======
2016-06-16 11:13:02.941668 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.941699 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.942308 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.942319 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ======
2016-06-16 11:13:02.942335 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.942367 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.943020 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.943031 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ======
2016-06-16 11:13:02.943047 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.943081 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.943646 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.943655 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ======
2016-06-16 11:13:02.943669 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.943699 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.944206 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.944216 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=-2010 http_status=400 ======
2016-06-16 11:13:02.944229 7f5da67fc700  1 civetweb: 0x7f5d880008c0: 10.101.140.222 - - [16/Jun/2016:11:13:02 +0000] "PUT /swift/xxx HTTP/1.1" 400 0 - jclouds/1.9.2 java/1.8.0_91
2016-06-16 11:13:02.944271 7f5da67fc700  1 ====== starting new request req=0x7f5da67f67d0 =====
2016-06-16 11:13:02.962268 7f5da67fc700  0 validated token: services:ceilometer expires: 1466078787
2016-06-16 11:13:02.965641 7f5da67fc700  0 ERROR: STREAM_IO(s)->complete_header() returned err=-5
2016-06-16 11:13:02.965658 7f5da67fc700  1 ====== req done req=0x7f5da67f67d0 op status=0 http_status=404 ======

#9 Updated by Peter Sabaini over 3 years ago

The above ^^ is from a 200OSD cluster running 10.2.0 fwiw

#10 Updated by Jarek Owsiewski over 3 years ago

Piotr Nowosielski wrote:

This is a serious problem. When occur - operations on the cluster increases dramatically. (my configuration: two clusters with Hammer, 100 OSD each).

Now we have 153 OSD's in each cluster. Situation didn't change:

"2016-06-17 10:33:07.740353 7efc78f41700 0 ERROR: s->cio->print() returned err=-5
2016-06-17 10:33:07.740357 7efc78f41700 0 ERROR: s->cio->print() returned err=-5
2016-06-17 10:33:07.740360 7efc78f41700 0 ERROR: s->cio->print() returned err=-5
2016-06-17 10:33:07.740371 7efc78f41700 0 ERROR: s->cio->complete_header() returned err=-5"

Issue occurs mostly during the deep-scrub operation and when troughput is on high level - 400k PUT/HEAD per hour. GET requests are not a problem in this situation.

#11 Updated by Yehuda Sadeh over 3 years ago

  • Assignee set to Samuel Just

Seems like osd side slowness.

#12 Updated by Yehuda Sadeh about 3 years ago

  • Project changed from rgw to Ceph

#13 Updated by Samuel Just about 3 years ago

  • Assignee deleted (Samuel Just)
  • Priority changed from Urgent to High

Yehuda: I don't quite follow, does RGW report -EIO if the OSD is slow?

#14 Updated by Sage Weil about 3 years ago

  • Status changed from New to Need Review
  • Priority changed from High to Urgent

The problem is simple: we are thrashing with min_size=size=2, and the readwrite test doesn't tolerate really slow rados ops.. which can be as slow as the thrasher decides to keep more than 0 osds down at any one time.

I think we should just remove the readwrite test from this part of the test matrix.

https://github.com/ceph/ceph-qa-suite/pull/1206

#15 Updated by Sage Weil about 3 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF