Project

General

Profile

Actions

Bug #3895

closed

librados test hang during mon thrashing

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-01-21_19:00:03-regression-master-testing-gcov/2929

job was

kernel:
  kdb: true
  sha1: e0b49868d3629708eda593b6739cb78f33ab238a
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
    coverage: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 3399860de2724281ee024b52f461b60f769ee0ee
  s3tests:
    branch: master
  workunit:
    sha1: 3399860de2724281ee024b52f461b60f769ee0ee
roles:
- - mon.a
  - mon.b
  - osd.0
  - osd.1
  - osd.2
- - mon.c
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- mon_thrash:
    revive_delay: 20
    thrash_delay: 1
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/test.sh

and test is hung on

2013-01-21T20:42:38.430 INFO:teuthology.task.workunit.client.0.out:[ RUN      ] LibRadosAio.IsSafePP

note that i also saw an ENOENT on a just-created pool the other day, too, so there are probably several similar bugs (or hopefully, the same pattern of bug) triggered by the mon thrashing.

yay testing!


Files

teuthology-2.log (7.26 MB) teuthology-2.log Sam Lang, 02/07/2013 12:49 PM
teuthology.log (12 MB) teuthology.log Sam Lang, 02/07/2013 12:49 PM
mon.a.log (6.05 MB) mon.a.log Sam Lang, 02/07/2013 01:07 PM
mon.b.log (5.69 MB) mon.b.log Sam Lang, 02/07/2013 01:07 PM
mon.c.log (9.58 MB) mon.c.log Sam Lang, 02/07/2013 01:07 PM

Updated by Sam Lang about 11 years ago

Attached log files for this from hung runs (librados and kernel untar).

Updated by Sam Lang about 11 years ago

Attached mon logs from a recent run after the rados test seemed to hang for a big (100 mon elections or so). The logs are with debug mon = 20, debug ms = 1

Actions #3

Updated by Sage Weil about 11 years ago

  • Status changed from 12 to Fix Under Review
  • Priority changed from High to Urgent

tracked this down; see wip-mon-eagain

qa run against rados api tests seems to confirm that this fixes it (previously was easily reproduced)

Actions #4

Updated by Joao Eduardo Luis about 11 years ago

wip-mon-eagain looks good

Actions #5

Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF