2017-04-25 03:42:43.680650 7fd86f583700 10 osd.2 pg_epoch: 797 pg[1.18s1( v 794'3333 (77'382,794'3333] local-les=748 n=12 ec=508 les/c/f 748/748/0 797/797/747) [5,2,0] r=1 lpr=797 pi=747-796/2 crt=794'3333 lcod 794'3331 mlcod 0'0 peering]
For position 0: selecting up[i]: 5(0)
For position 1: selecting up[i]: 2(1)
For position 2: backfilling up[i]: 0(2) and failed to fill position 2
2017-04-25 03:42:43.680665 7fd86f583700 10 osd.2 pg_epoch: 797 pg[1.18s1( v 794'3333 (77'382,794'3333] local-les=748 n=12 ec=508 les/c/f 748/748/0 797/797/747) [5,2,0] r=1 lpr=797 pi=747-796/2 crt=794'3333 lcod 794'3331 mlcod 0'0 peering] choose_acting failed, below min size
osd.2 removed pg1.18s2 from osd.0 after it completed recovery. that's why osd.0 was not able to offer pg log, when osd.2 queried it.
2017-04-25 03:41:43.471424 7fd87fb20700 10 osd.2 pg_epoch: 748 pg[1.18s1( v 738'3249 (77'382,738'3249] local-les=748 n=10 ec=508 les/c/f 748/748/0 743/747/747) [5,2,3] r=1 lpr=747 crt=738'3249 lcod 735'3247 mlcod 0'0 active+clean+snaptrim_wait snaptrimq=[2b6~1]] purge_strays 0(2)
2017-04-25 03:41:43.471429 7fd87fb20700 10 osd.2 pg_epoch: 748 pg[1.18s1( v 738'3249 (77'382,738'3249] local-les=748 n=10 ec=508 les/c/f 748/748/0 743/747/747) [5,2,3] r=1 lpr=747 crt=738'3249 lcod 735'3247 mlcod 0'0 active+clean+snaptrim_wait snaptrimq=[2b6~1]] sending PGRemove to osd.0(2)
2017-04-25 03:41:43.471437 7fd87fb20700 1 -- 172.21.15.82:6809/664745 --> 172.21.15.82:6801/676928 -- osd pg remove(epoch 748; pg1.18s2; ) v3 -- 0x563a14dcda00 con 0
the min_size of isa(k=2,m=1) is 3. at that moment,
- osd.5 is up and has shard.0,
- osd.2 is up and has shard.1
- osd.0 should have shard.2, but it does not have enough pg log, so it's a backfill target
then pg refused to recover itself, because the number of acting osds (2) that we want is below min_size. even isa(k=2,m=1) should be able to decode the data from shard.0 and shard.1.