Project

General

Profile

Actions

Bug #2860

closed

osd: stuck waiting for pg acting set to change

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
argonaut
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


2012-07-27 19:24:47.581730 7f9e4d22c700  1 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 inactive] state<Start>: transitioning to Primary
2012-07-27 19:24:47.581751 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] _calc_past_interval_range: already have past intervals back to 102505
2012-07-27 19:24:47.581770 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior interval(103334-103337 [66,332,262]/[66] maybe_went_rw)
2012-07-27 19:24:47.581787 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior interval(102724-103333 [66,332,262]/[174,66] maybe_went_rw)
2012-07-27 19:24:47.581804 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior  prior osd.174 is down
2012-07-27 19:24:47.581817 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior interval(102722-102723 [66,332,262]/[174,332] maybe_went_rw)
2012-07-27 19:24:47.581832 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering]  PriorSet: build_prior final: probe 66,262,332 down 174 blocked_by {}
2012-07-27 19:24:47.581845 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] up_thru 103336 < same_since 103338, must notify monitor
2012-07-27 19:24:47.581881 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>:  querying info from osd.262
2012-07-27 19:24:47.581901 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>:  querying info from osd.332
2012-07-27 19:24:47.780280 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] handle_notify 5.27d0( DNE empty n=0 ec=0 les/c 0/0 0/0/0) from osd.262
2012-07-27 19:24:47.780308 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering]  got osd.262 5.27d0( DNE empty n=0 ec=0 les/c 0/0 0/0/0)
2012-07-27 19:24:47.780337 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] update_heartbeat_peers 66,262,332 unchanged
2012-07-27 19:24:47.813217 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] handle_notify 5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102701/102505 102722/103338/103334) from osd.332
2012-07-27 19:24:47.813252 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering]  got osd.332 5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102701/102505 102722/103338/103334)
2012-07-27 19:24:47.813296 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] update_heartbeat_peers 66,262,332 unchanged
2012-07-27 19:24:47.813315 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>:  last maybe_went_rw interval was interval(103334-103337 [66,332,262]/[66] maybe_went_rw)
2012-07-27 19:24:47.813348 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] calc_acting osd.66 5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334)
2012-07-27 19:24:47.813365 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] calc_acting osd.262 5.27d0( DNE empty n=0 ec=0 les/c 0/0 0/0/0)
2012-07-27 19:24:47.813379 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] calc_acting osd.332 5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102701/102505 102722/103338/103334)
2012-07-27 19:24:47.813396 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] calc_acting no suitable info found (incomplete backfills?)
2012-07-27 19:24:47.813453 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 peering] choose_acting failed, marking pg down
2012-07-27 19:24:47.813490 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down+peering] state<Started/Primary/Peering>: Leaving Peering
2012-07-27 19:24:47.848085 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down] handle_notify 5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102701/102505 102722/103338/103334) from osd.332
2012-07-27 19:24:47.848124 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down] state<Started/Primary/Peering/WaitActingChange>: In WaitActingChange, ignoring MNotifyRec
2012-07-27 19:24:48.487392 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down] handle_notify 5.27d0( empty n=0 ec=22869 les/c 102505/102505 102722/103338/103334) from osd.278
2012-07-27 19:24:48.487427 7f9e4d22c700 10 osd.66 103338 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down] state<Started/Primary/Peering/WaitActingChange>: In WaitActingChange, ignoring MNotifyRec
2012-07-27 19:24:51.668571 7f9e4d22c700 10 osd.66 103339 Scanning pg pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down]

... some times goes by ...

2012-07-27 19:25:18.958835 7f9e4d22c700 10 osd.66 103348 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down] handle_notify 5.27d0( v 71349'1407 (35421'407,71349'1407] n=48 ec=22869 les/c 102725/102505 102722/103338/103334) from osd.174
2012-07-27 19:25:18.958923 7f9e4d22c700 10 osd.66 103348 pg[5.27d0( v 71349'1407 (35421'407,71349'1407] lb 0//0//-1 n=0 ec=22869 les/c 102725/102505 102722/103338/103334) [66,332,262] r=0 lpr=103338 pi=87591-103337/65 lcod 0'0 mlcod 0'0 down] state<Started/Primary/Peering/WaitActingChange>: In WaitActingChange, ignoring MNotifyRec

this is 0.48argonaut-54-g9db7809, congress osd.66

Actions #1

Updated by Sage Weil over 11 years ago

i can reproduce this with:

CEPH_NUM_OSD=4 CEPH_NUM_MDS=1 CEPH_NUM_MON=1 ./vstart.sh  -d -n -x    -o 'osd min pg log entries = 10'
./ceph osd out 0 1
./rados -p rbd bench 30 write -b 1024
./ceph osd out 2 3
./ceph osd in 0 1
sleep 10
./init-ceph stop osd.0
./init-ceph stop osd.1
./ceph osd down 0 1
grep 2\\.6\( out/osd.0.log
Actions #2

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to Fix Under Review
Actions #3

Updated by Sage Weil over 11 years ago

  • Status changed from Fix Under Review to Resolved
  • Backport set to argonaut
Actions

Also available in: Atom PDF