Bug #20844
peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
0%
Description
2017-07-28T05:53:34.334 INFO:tasks.thrashosds.thrasher:Waiting for clean again
but
pgs: 1.852% pgs not active 52 active+clean 1 incomplete 1 active+clean+scrubbing
query is
"recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2017-07-28 05:53:34.326369", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2017-07-28 05:53:34.325201", "past_intervals": [ { "first": "672", "last": "763", "all_participants": [ { "osd": 0, "shard": 2 }, { "osd": 1, "shard": 0 }, { "osd": 3, "shard": 1 } ], "intervals": [ { "first": "672", "last": "761", "acting": "0(2),1(0),3(1)" } ] } ], "probing_osds": [ "0(2)", "3(1)", "4(0)" ], "down_osds_we_would_probe": [ 1 ], "peering_blocked_by": [], "peering_blocked_by_detail": [ { "detail": "peering_blocked_by_history_les_bound" } ] }, { "name": "Started", "enter_time": "2017-07-28 05:53:34.325176" } ],
/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455398
History
#1 Updated by Sage Weil over 6 years ago
/a/sage-2017-08-01_15:32:10-rados-wip-sage-testing-distro-basic-smithi/1469176
rados/thrash-erasure-code/{ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-end.yaml fast/fast.yaml leveldb.yaml msgr-failures/osd-delay.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/mapgap.yaml thrashosds-health.yaml workloads/ec-rados-plugin=jerasure-k=3-m=1.yaml}
#2 Updated by Sage Weil over 6 years ago
root@smithi200:~# ceph tell 2.1b query { "state": "incomplete", "snap_trimq": "[]", "epoch": 3426, "up": [ 2, 3, 0, 5 ], "acting": [ 2, 3, 0, 5 ], ... "probing_osds": [ "0(2)", "2(0)", "3(1)", "5(3)" ], "down_osds_we_would_probe": [ 1 ], "peering_blocked_by": [], "peering_blocked_by_detail": [ { "detail": "peering_blocked_by_history_les_bound" } ] }, { "name": "Started", "enter_time": "2017-08-01 16:36:39.822172" } ], "agent_state": {} }
#3 Updated by Sage Weil over 6 years ago
/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1470073
pg 2.d on [5,1,4]
#4 Updated by Sage Weil over 6 years ago
- Priority changed from Urgent to Immediate
#5 Updated by Sage Weil over 6 years ago
This appears to be a test problem:
- the thrashosds has 'chance_test_map_discontinuity: 0.5', which will mark an osd down, wait for things to go clean, and then bring it back up.
- the workload creates an ec pool with the teuthologyprofile profile, which is apparent k=2 m=1. That means min_size=2+1=3, and any osd down will usually prevent us from going clean.
I'm not sure why we're seeing this now and we weren't before. It seems like the fix is to do something like k=2 m=2, though?
#6 Updated by Sage Weil over 6 years ago
- Status changed from 12 to Fix Under Review
#7 Updated by Sage Weil over 6 years ago
- Status changed from Fix Under Review to Resolved