Project

General

Profile

Bug #19901

LibRadosMiscConnectFailure.ConnectFailure hang

Added by Sage Weil 7 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
Start date:
05/10/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel,kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

I have been seeing test.sh failures for a while where the only clue (that I see) is

2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ wait 269197
2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ for t in '"${!pids[@]}"'
2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ pid=269342
2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ wait 269342
2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ for t in '"${!pids[@]}"'
2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ pid=269169
2017-05-10T05:15:33.027 INFO:tasks.workunit.client.0.smithi183.stderr:+ wait 269169
2017-05-10T08:05:35.700 INFO:tasks.workunit.client.0.smithi183.stderr:++ cleanup
2017-05-10T08:05:35.840 INFO:tasks.workunit.client.0.smithi183.stderr:++ pkill -P 269150
2017-05-10T08:05:35.844 INFO:tasks.workunit.client.0.smithi183.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 9: 269169 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$f.log | sed \
"s/^/$r: /\"" 
2017-05-10T08:05:35.846 INFO:tasks.workunit.client.0.smithi183.stderr:++ true
2017-05-10T08:05:35.849 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2017-05-10T08:05:35.852 INFO:teuthology.orchestra.run.smithi183:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
2017-05-10T08:05:35.972 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-sage-testing2/qa/tasks/workunit.py", line 176, in task
    config.get('env'), timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:

I can't tell which child is being killed. This is been happening for a while but I haven't been tracking it. Time to start!

/a/sage-2017-05-10_03:08:19-rados-wip-sage-testing2---basic-smithi/1119773


Related issues

Copied to Ceph - Backport #20270: jewel: LibRadosMiscConnectFailure.ConnectFailure hang Resolved
Copied to Ceph - Backport #20271: kraken: LibRadosMiscConnectFailure.ConnectFailure hang Resolved

History

#1 Updated by Nathan Cutler 7 months ago

  • Backport set to jewel,kraken

#2 Updated by Nathan Cutler 7 months ago

I see it's timing out. Could it be related to #19540 ? I've seen it time out in jewel runs, too.

#3 Updated by Sage Weil 6 months ago

it's LibRadosMiscConnectFailure.ConnectFailure

2017-05-10T05:05:35.819 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.757006 7f44637fe700 10 monclient: dump:
2017-05-10T05:05:35.819 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: epoch 2
2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: fsid 8533721c-4fc0-4293-b3e4-20b21ab38cda
2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: last_changed 2017-05-10 05:05:20.546842
2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: created 2017-05-10 05:04:53.471303
2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 0: 172.21.15.155:6789/0 mon.b
2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 1: 172.21.15.183:6789/0 mon.a
2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2: 172.21.15.183:6790/0 mon.c
2017-05-10T05:05:35.820 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc:
2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.757189 7f44845cd800  1 librados: init done
2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.757289 7f44637fe700  1 -- 172.21.15.183:0/18590574 <== mon.1 172.21.15.183:6789/0 6 ==== mgrmap(e 4) v1 ==== 90+0+0 (3499765727 0 0) 0x7f445c001a20 con 0x56539fdf4550
2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.757496 7f44845cd800 10 librados: watch_flush enter
2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.758028 7f44637fe700 10 client.4129.objecter ms_handle_connect 0x7f445800c550
2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.758041 7f44637fe700  1 -- 172.21.15.183:0/18590574 <== mon.1 172.21.15.183:6789/0 7 ==== osd_map(9..9 src has 1..9) v3 ==== 2484+0+0 (4220760141 0 0) 0x7f445c0010a0 con 0x56539fdf4550
2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.758058 7f44637fe700 10 client.4129.objecter ms_dispatch 0x56539fd12210 osd_map(9..9 src has 1..9) v3
2017-05-10T05:05:35.821 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.758072 7f44637fe700  3 client.4129.objecter handle_osd_map got epochs [9,9] > 0
2017-05-10T05:05:35.822 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-05-10 05:05:35.758075 7f44637fe700  3 client.4129.objecter handle_osd_map decoding full epoch 9
2017-05-10T05:05:35.822 INFO:tasks.workunit.client.0.smithi183.stdout:                 api_misc: 2017-test api_c_write_operations on pid 269316

#4 Updated by Sage Weil 6 months ago

  • Subject changed from mystery test.sh failure to LibRadosMiscConnectFailure.ConnectFailure hang
  • Status changed from New to Verified

#5 Updated by Sage Weil 6 months ago

  • Status changed from Verified to Need Review

#6 Updated by Kefu Chai 6 months ago

  • Status changed from Need Review to Resolved

#7 Updated by Kefu Chai 6 months ago

  • Status changed from Resolved to Pending Backport

#8 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #20270: jewel: LibRadosMiscConnectFailure.ConnectFailure hang added

#9 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #20271: kraken: LibRadosMiscConnectFailure.ConnectFailure hang added

#10 Updated by Nathan Cutler 3 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF