Project

General

Profile

Bug #35924

choose_acting picked want > pool size

Added by Sage Weil 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
09/11/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

2018-09-10 21:28:37.713 7f3d9523e700  5 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] enter Started/Primary/Peering/GetLog
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting all_info osd.0 4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117
 les/c/f 157/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting all_info osd.1 4.c( v 154'691 lc 123'110 (0'0,154'691] local-lis/les=144/145 n=691 ec=117/117 lis/c 156/117
 les/c/f 157/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting all_info osd.4 4.c( v 159'745 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 15
7/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] find_best_info prefer osd.4 because it is complete while best has missing
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] calc_replicated_acting newest update on osd.4 with 4.c( v 159'745 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 li
s/c 156/117 les/c/f 157/118/0 160/161/161)
calc_replicated_acting primary is osd.4 with 4.c( v 159'745 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161)
 osd.0 (up) accepted 4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161)
 osd.1 (up) accepted 4.c( v 154'691 lc 123'110 (0'0,154'691] local-lis/les=144/145 n=691 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161)
2018-09-10 21:28:37.713 7f3d9523e700 20 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_async_recovery_replicated candidates by cost are: 
2018-09-10 21:28:37.713 7f3d9523e700 20 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_async_recovery_replicated result want=[4,0,1] async_recovery=
2018-09-10 21:28:37.713 7f3d9523e700 10 osd.0 pg_epoch: 161 pg[4.c( v 159'745 lc 146'579 (0'0,159'745] local-lis/les=156/157 n=745 ec=117/117 lis/c 156/117 les/c/f 157/118/0 160/161/161) [0,1] r=0 lpr=161 pi=[117,161)/2 crt=159'745 lcod 146'578 mlcod 0'0 peering m=112 mbc={}] choose_acting want [4,0,1] != acting [0,1], requesting pg_temp change

but in that epoch,
pool 4 'unique_pool_2' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 118 flags hashpspool,creating stripe_width 0 application rados

/a/sage-2018-09-10_17:11:45-rados-wip-sage-testing-2018-09-10-0917-distro-basic-smithi/3002911

This leads to the PG getting stuck because the mon now rejects pg_temps that are > the pool size.


Related issues

Copied to RADOS - Backport #35962: luminous: choose_acting picked want > pool size Resolved
Copied to RADOS - Backport #35963: mimic: choose_acting picked want > pool size Resolved

History

#1 Updated by Sage Weil 3 months ago

  • Status changed from Verified to Need Review
  • Priority changed from Immediate to Urgent

#2 Updated by Sage Weil 3 months ago

  • Backport set to mimic,luminous

#3 Updated by Josh Durgin 3 months ago

  • Status changed from Need Review to Pending Backport

#4 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #35962: luminous: choose_acting picked want > pool size added

#5 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #35963: mimic: choose_acting picked want > pool size added

#6 Updated by Nathan Cutler 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF