Project

General

Profile

Actions

Bug #3675

closed

osd: hang during intial peering

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
bobtail
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

the initial wait for healthy blocked on 2 pgs. ms inject socket failres = 500. everything was up.

no logs, so it's unclear why.

there were lots of messages like

2012-12-21 14:15:45.285022 7fc27eca0700  0 -- 10.214.132.10:6807/5960 >> 10.214.132.10:6802/5955 pipe(0x302cb40 sd=33 :6807 pgs=8 cs=3 l=0).reader got old message 1 <= 3 0x2fb4c40 pg_query(2.13 epoch 4) v2, discarding
2012-12-21 14:15:45.285233 7fc27eca0700  0 -- 10.214.132.10:6807/5960 >> 10.214.132.10:6802/5955 pipe(0x302cb40 sd=33 :6807 pgs=8 cs=3 l=0).reader got old message 2 <= 3 0x2fb4c40 pg_notify(2.14 epoch 4) v4, discarding
2012-12-21 14:15:45.285342 7fc27eca0700  0 -- 10.214.132.10:6807/5960 >> 10.214.132.10:6802/5955 pipe(0x302cb40 sd=33 :6807 pgs=8 cs=3 l=0).reader got old message 3 <= 3 0x2fb4c40 pg_notify(0.16 epoch 4) v4, discarding

but no messages related to the hung pgs.

job was

ubuntu@teuthology:/a/sage-2012-12-21_13:39:03-regression-next-testing-basic/19821$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: ec18aeecd4de479601363849d489668d8f12410c
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 500
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: accce830514c6b099eb0e00a8ae34396d14565a3
  s3tests:
    branch: next
  workunit:
    sha1: accce830514c6b099eb0e00a8ae34396d14565a3
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/test_python.sh

but this was before the workunit ran, so just doing the cluster setup/teardown should reproduce eventually.


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #4271: osdc/ObjectCacher.cc: 834: FAILED assert(ob->last_commit_tid < tid)ResolvedSage Weil02/26/2013

Actions
Actions

Also available in: Atom PDF