Project

General

Profile

Bug #1617

pgs stuck down and peering with only one osd down and out

Added by Josh Durgin about 8 years ago. Updated almost 8 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
Start date:
10/13/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

From teuthology:~teuthworker/archive/nightly_coverage_2011-10-13/491/teuthology.log:

2011-10-13T15:46:33.996 INFO:teuthology.task.thrashosds.ceph_manager:2011-10-13 15:40:58.190910    pg v1117: 144 pgs: 141 active+clean, 3 down+peering; 27400 MB data, 105 GB used, 719 GB / 869 GB avail
2011-10-13 15:40:58.191662   mds e5: 1/1/1 up {0=0=up:active}
2011-10-13 15:40:58.191713   osd e125: 8 osds: 7 up, 7 in
2011-10-13 15:40:58.191783   log 2011-10-13 12:39:15.188577 mon.0 10.3.14.194:6791/0 63 : [INF] osd.3 out (down for 300.693807)
2011-10-13 15:40:58.191857   mon e1: 3 mons at {0=10.3.14.194:6791/0,1=10.3.14.198:6789/0,2=10.3.14.184:6790/0}

History

#1 Updated by Josh Durgin about 8 years ago

Happened in run 494 as well. These were both rados bench with thrashing.

#2 Updated by Sage Weil about 8 years ago

  • Status changed from New to Rejected

non-specific, and pre-prior set refactor.

#3 Updated by Josh Durgin about 8 years ago

  • Status changed from Rejected to New
  • Target version changed from v0.38 to v0.39

Happened again today in teuthology:~teuthworker/archive/nightly_coverage_2011-11-03/1433:

$ LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -c /tmp/cephtest/ceph.conf -s
2011-11-03 14:02:00.316911    pg v6925: 144 pgs: 142 active+clean, 2 down+peering; 126 MB data, 15506 MB used, 3141 GB / 3172 GB avail
2011-11-03 14:02:00.317602   mds e5: 1/1/1 up {0=0=up:active}
2011-11-03 14:02:00.317658   osd e1645: 8 osds: 7 up, 7 in
2011-11-03 14:02:00.317768   log 2011-11-03 14:01:52.600845 osd.6 10.3.14.191:6803/8573 460 : [INF] 0.0p6 scrub ok
2011-11-03 14:02:00.317852   mon e1: 3 mons at {0=10.3.14.133:6791/0,1=10.3.14.167:6789/0,2=10.3.14.170:6790/0}

#4 Updated by Sage Weil almost 8 years ago

  • Target version changed from v0.39 to v0.40

#5 Updated by Sage Weil almost 8 years ago

  • Status changed from New to Won't Fix

the new code will have an explicit 'incomplete' state when peering fails, instead of being 'stuck'. let's ignore this and see how the new code fares.

#6 Updated by Sage Weil almost 8 years ago

  • translation missing: en.field_position set to 1
  • translation missing: en.field_position changed from 1 to 1049

Also available in: Atom PDF