Bug #19770
closedqa, mon: min_size = size for ec pools
0%
Description
this test
description: rados/thrash-erasure-code-isa/{arch/x86_64.yaml clusters/{fixed-2.yaml openstack.yaml} leveldb.yaml msgr-failures/fastclose.yaml objectstore/filestore-btrfs.yaml rados.yaml supported/ubuntu_latest.yaml thrashers/mapgap.yaml workloads/ec-rados-plugin=isa-k=2-m=1.yaml}
sets size = min_size = 3 and then thrashes with
- thrashosds: chance_pgnum_grow: 0.25 chance_pgpnum_fix: 0.25 chance_test_map_discontinuity: 2 timeout: 1800
which tries to go clean with one osd down.
/a/sage-2017-04-25_02:25:56-rados-wip-sage-testing---basic-smithi/1065942
Updated by Kefu Chai almost 7 years ago
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP 1.9 9 0 0 0 0 10199040 1062 1062 active+clean 2017-04-25 04:12:25.754999 820'1062 1407:1572 [5,0,2] 5 [5,0,2] 5 820'1062 2017-04-25 04:12:25.754861 820'1062 2017-04-25 03:57:07.484966 1.18 12 0 0 0 0 15982592 2951 2951 incomplete 2017-04-25 03:42:43.680738 794'3333 1407:3118 [5,2,0] 2 [5,2,0] 2 738'3249 2017-04-25 03:41:43.618851 133'657 2017-04-25 03:29:13.363153 1.8 47 0 0 0 0 64380928 3088 3088 incomplete 2017-04-25 03:42:43.680380 794'4362 1407:3994 [5,2,0] 2 [5,2,0] 2 729'3940 2017-04-25 03:41:21.639397 133'657 2017-04-25 03:29:13.363153
the two incomplete PG had three OSDs in its acting set. and the last chosen thrash action was
2017-04-25T03:41:49.663 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
so it had at least 3 OSD in. i cannot read the osd log (the "remote" dir is missing under /a/sage-2017-04-25_02:25:56-rados-wip-sage-testing---basic-smithi/1065942), so not able to tell why these two pg were incomplete.
Updated by Sage Weil almost 7 years ago
the logs are in teh base test directory (i copied them manually)
Updated by Kefu Chai almost 7 years ago
2017-04-25 03:42:43.680650 7fd86f583700 10 osd.2 pg_epoch: 797 pg[1.18s1( v 794'3333 (77'382,794'3333] local-les=748 n=12 ec=508 les/c/f 748/748/0 797/797/747) [5,2,0] r=1 lpr=797 pi=747-796/2 crt=794'3333 lcod 794'3331 mlcod 0'0 peering] For position 0: selecting up[i]: 5(0) For position 1: selecting up[i]: 2(1) For position 2: backfilling up[i]: 0(2) and failed to fill position 2 2017-04-25 03:42:43.680665 7fd86f583700 10 osd.2 pg_epoch: 797 pg[1.18s1( v 794'3333 (77'382,794'3333] local-les=748 n=12 ec=508 les/c/f 748/748/0 797/797/747) [5,2,0] r=1 lpr=797 pi=747-796/2 crt=794'3333 lcod 794'3331 mlcod 0'0 peering] choose_acting failed, below min size
osd.2 removed pg1.18s2 from osd.0 after it completed recovery. that's why osd.0 was not able to offer pg log, when osd.2 queried it.
2017-04-25 03:41:43.471424 7fd87fb20700 10 osd.2 pg_epoch: 748 pg[1.18s1( v 738'3249 (77'382,738'3249] local-les=748 n=10 ec=508 les/c/f 748/748/0 743/747/747) [5,2,3] r=1 lpr=747 crt=738'3249 lcod 735'3247 mlcod 0'0 active+clean+snaptrim_wait snaptrimq=[2b6~1]] purge_strays 0(2) 2017-04-25 03:41:43.471429 7fd87fb20700 10 osd.2 pg_epoch: 748 pg[1.18s1( v 738'3249 (77'382,738'3249] local-les=748 n=10 ec=508 les/c/f 748/748/0 743/747/747) [5,2,3] r=1 lpr=747 crt=738'3249 lcod 735'3247 mlcod 0'0 active+clean+snaptrim_wait snaptrimq=[2b6~1]] sending PGRemove to osd.0(2) 2017-04-25 03:41:43.471437 7fd87fb20700 1 -- 172.21.15.82:6809/664745 --> 172.21.15.82:6801/676928 -- osd pg remove(epoch 748; pg1.18s2; ) v3 -- 0x563a14dcda00 con 0the min_size of isa(k=2,m=1) is 3. at that moment,
- osd.5 is up and has shard.0,
- osd.2 is up and has shard.1
- osd.0 should have shard.2, but it does not have enough pg log, so it's a backfill target
then pg refused to recover itself, because the number of acting osds (2) that we want is below min_size. even isa(k=2,m=1) should be able to decode the data from shard.0 and shard.1.
Updated by Kefu Chai almost 7 years ago
if an erasure pool has "min_size = size = 3", it cannot survive any down OSD serving any of its PGs. so we should set its min_size to 2 as long as 2 greater or equal to its k, which is 2 in this case.
Updated by Kefu Chai almost 7 years ago
- Status changed from New to Fix Under Review
- Assignee set to Kefu Chai
Updated by Kefu Chai almost 7 years ago
- Status changed from Fix Under Review to Resolved