Project

General

Profile

Actions

Bug #15351

closed

librbd: resize timed out

Added by Josh Durgin about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/teuthology-2016-03-30_12:01:01-rbd-jewel-distro-basic-smithi/97490/teuthology.log

2016-03-31T16:41:40.793 INFO:teuthology.orchestra.run.smithi059.stdout:4034 trunc       from 0x6e11185 to 0x63ae4fd
2016-03-31T16:41:42.362 INFO:tasks.thrashosds.thrasher:in_osds:  [4, 0, 1, 5, 3] out_osds:  [2] dead_osds:  [] live_osds:  [2, 5, 0, 1, 4, 3]
2016-03-31T16:41:42.362 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2016-03-31T16:41:42.362 INFO:tasks.thrashosds.thrasher:inject_pause on 3
2016-03-31T16:41:42.363 INFO:tasks.thrashosds.thrasher:Testing filestore_inject_stall pause injection for duration 3
2016-03-31T16:41:42.363 INFO:tasks.thrashosds.thrasher:Checking after 0, should_be_down=False
2016-03-31T16:41:42.363 INFO:teuthology.orchestra.run.smithi032:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok config set filestore_inject_stall 3'
2016-03-31T16:41:45.845 INFO:teuthology.orchestra.run.smithi059.stdout:rbd_resize(104523005) failed
2016-03-31T16:41:45.845 INFO:teuthology.orchestra.run.smithi059.stdout:dotruncate: ops->resize: Connection timed out
2016-03-31T16:41:45.845 INFO:teuthology.orchestra.run.smithi059.stdout:LOG DUMP (4034 total operations):
Actions #1

Updated by Jason Dillaman about 8 years ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman
Actions #2

Updated by Jason Dillaman about 8 years ago

Having trouble figuring out where the ETIMEDOUT error came from. In theory, this error should bubble-up from the Objecter unless time-outs were configured (which they weren't). ERESTART would make more sense to me since the watch timed out, which would cause the resize to be canceled with ERESTART.

Actions #3

Updated by Josh Durgin about 8 years ago

Not sure where it's coming from either. Seems like it would have to be one of the notifies or something watch related.

Actions #4

Updated by Jason Dillaman about 8 years ago

  • Priority changed from Urgent to High
Actions #6

Updated by Jason Dillaman about 8 years ago

Pretty consistent that is happens after a "Testing filestore_inject_stall pause injection for duration 3" thrasher operation.

Actions #8

Updated by Jason Dillaman about 8 years ago

  • Status changed from In Progress to Fix Under Review
Actions #9

Updated by Jason Dillaman about 8 years ago

That was a subtle one -- couldn't reproduce w/ killing OSDs, hanging OSDs, etc. Needed a notify timeout during the final header update notification.

Actions #10

Updated by Jason Dillaman about 8 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF