Actions
Bug #19901
closedLibRadosMiscConnectFailure.ConnectFailure hang
Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
jewel,kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I have been seeing test.sh failures for a while where the only clue (that I see) is
2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ wait 269197 2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ for t in '"${!pids[@]}"' 2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ pid=269342 2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ wait 269342 2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ for t in '"${!pids[@]}"' 2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ pid=269169 2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ wait 269169 2017-05-10T08:05:35.700 INFO:tasks.workunit.client.0.smithi183.stderr:++ cleanup 2017-05-10T08:05:35.840 INFO:tasks.workunit.client.0.smithi183.stderr:++ pkill -P 269150 2017-05-10T08:05:35.844 INFO:tasks.workunit.client.0.smithi183.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 9: 269169 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$f.log | sed \ "s/^/$r: /\"" 2017-05-10T08:05:35.846 INFO:tasks.workunit.client.0.smithi183.stderr:++ true 2017-05-10T08:05:35.849 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0... 2017-05-10T08:05:35.852 INFO:teuthology.orchestra.run.smithi183:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0' 2017-05-10T08:05:35.972 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-sage-testing2/qa/tasks/workunit.py", line 176, in task config.get('env'), timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__ for result in self:
I can't tell which child is being killed. This is been happening for a while but I haven't been tracking it. Time to start!
/a/sage-2017-05-10_03:08:19-rados-wip-sage-testing2---basic-smithi/1119773
Updated by Nathan Cutler almost 7 years ago
I see it's timing out. Could it be related to #19540 ? I've seen it time out in jewel runs, too.
Updated by Sage Weil almost 7 years ago
it's LibRadosMiscConnectFailure.ConnectFailure
2017-05-10T05:05:35.819 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.757006 7f44637fe700 10 monclient: dump: 2017-05-10T05:05:35.819 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: epoch 2 2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: fsid 8533721c-4fc0-4293-b3e4-20b21ab38cda 2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: last_changed 2017-05-10 05:05:20.546842 2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: created 2017-05-10 05:04:53.471303 2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 0: 172.21.15.155:6789/0 mon.b 2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 1: 172.21.15.183:6789/0 mon.a 2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2: 172.21.15.183:6790/0 mon.c 2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.757189 7f44845cd800 1 librados: init done 2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.757289 7f44637fe700 1 -- 172.21.15.183:0/18590574 <== mon.1 172.21.15.183:6789/0 6 ==== mgrmap(e 4) v1 ==== 90+0+0 (3499765727 0 0) 0x7f445c001a20 con 0x56539fdf4550 2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.757496 7f44845cd800 10 librados: watch_flush enter 2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.758028 7f44637fe700 10 client.4129.objecter ms_handle_connect 0x7f445800c550 2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.758041 7f44637fe700 1 -- 172.21.15.183:0/18590574 <== mon.1 172.21.15.183:6789/0 7 ==== osd_map(9..9 src has 1..9) v3 ==== 2484+0+0 (4220760141 0 0) 0x7f445c0010a0 con 0x56539fdf4550 2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.758058 7f44637fe700 10 client.4129.objecter ms_dispatch 0x56539fd12210 osd_map(9..9 src has 1..9) v3 2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.758072 7f44637fe700 3 client.4129.objecter handle_osd_map got epochs [9,9] > 0 2017-05-10T05:05:35.822 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-05-10 05:05:35.758075 7f44637fe700 3 client.4129.objecter handle_osd_map decoding full epoch 9 2017-05-10T05:05:35.822 INFO:tasks.workunit.client.0.smithi183.stdout: api_misc: 2017-test api_c_write_operations on pid 269316
Updated by Sage Weil almost 7 years ago
- Subject changed from mystery test.sh failure to LibRadosMiscConnectFailure.ConnectFailure hang
- Status changed from New to 12
Updated by Sage Weil almost 7 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai almost 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kefu Chai almost 7 years ago
- Status changed from Resolved to Pending Backport
Updated by Nathan Cutler almost 7 years ago
- Copied to Backport #20270: jewel: LibRadosMiscConnectFailure.ConnectFailure hang added
Updated by Nathan Cutler almost 7 years ago
- Copied to Backport #20271: kraken: LibRadosMiscConnectFailure.ConnectFailure hang added
Updated by Nathan Cutler over 6 years ago
- Status changed from Pending Backport to Resolved
Actions