Project

General

Profile

Actions

Bug #35924

closed

choose_acting picked want > pool size

Added by Sage Weil over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-09-10 21:28:37.713 7f3d9523e700  5 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] enter Started/Primary/Peering/GetLog
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting all_info osd.0 4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117
 les/c/f 157/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting all_info osd.1 4.c( v 154'691 lc 123'110 (0'0,154'691] local-lis/les=144/145 n=691 ec=117/117 lis/c 156/117
 les/c/f 157/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting all_info osd.4 4.c( v 159'745 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 15
7/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] find_best_info prefer osd.4 because it is complete while best has missing
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] calc_replicated_acting newest update on osd.4 with 4.c( v 159'745 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 li
s/c 156/117 les/c/f 157/118/0 160/161/161)
calc_replicated_acting primary is osd.4 with 4.c( v 159'745 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161)
 osd.0 (up) accepted 4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161)
 osd.1 (up) accepted 4.c( v 154'691 lc 123'110 (0'0,154'691] local-lis/les=144/145 n=691 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 20 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_async_recovery_replicated candidates by cost are: 
2018-09-10 21:28:37.713 7f3d9523e700 20 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_async_recovery_replicated result want=[4,0,1] async_recovery=
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting want [4,0,1] != acting [0,1], requesting pg_temp change

but in that epoch,
pool 4 'unique_pool_2' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 118 flags hashpspool,creating stripe_width 0 application rados

/a/sage-2018-09-10_17:11:45-rados-wip-sage-testing-2018-09-10-0917-distro-basic-smithi/3002911

This leads to the PG getting stuck because the mon now rejects pg_temps that are > the pool size.


Related issues 3 (0 open3 closed)

Related to RADOS - Bug #42577: acting_recovery_backfill won't catch all up peersRejectedxie xingguo

Actions
Copied to RADOS - Backport #35962: luminous: choose_acting picked want > pool sizeResolvedPrashant DActions
Copied to RADOS - Backport #35963: mimic: choose_acting picked want > pool sizeResolvedPrashant DActions
Actions #1

Updated by Sage Weil over 5 years ago

  • Status changed from 12 to Fix Under Review
  • Priority changed from Immediate to Urgent
Actions #2

Updated by Sage Weil over 5 years ago

  • Backport set to mimic,luminous
Actions #3

Updated by Josh Durgin over 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #4

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #35962: luminous: choose_acting picked want > pool size added
Actions #5

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #35963: mimic: choose_acting picked want > pool size added
Actions #6

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved
Actions #7

Updated by Nathan Cutler over 4 years ago

  • Related to Bug #42577: acting_recovery_backfill won't catch all up peers added
Actions

Also available in: Atom PDF