Project

General

Profile

Bug #20844

peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml

Added by Sage Weil 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
07/28/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):

Description

2017-07-28T05:53:34.334 INFO:tasks.thrashosds.thrasher:Waiting for clean again

but
    pgs:     1.852% pgs not active
             52 active+clean
             1  incomplete
             1  active+clean+scrubbing

query is
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/Incomplete",
            "enter_time": "2017-07-28 05:53:34.326369",
            "comment": "not enough complete instances of this PG" 
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2017-07-28 05:53:34.325201",
            "past_intervals": [
                {
                    "first": "672",
                    "last": "763",
                    "all_participants": [
                        {
                            "osd": 0,
                            "shard": 2
                        },
                        {
                            "osd": 1,
                            "shard": 0
                        },
                        {
                            "osd": 3,
                            "shard": 1
                        }
                    ],
                    "intervals": [
                        {
                            "first": "672",
                            "last": "761",
                            "acting": "0(2),1(0),3(1)" 
                        }
                    ]
                }
            ],
            "probing_osds": [
                "0(2)",
                "3(1)",
                "4(0)" 
            ],
            "down_osds_we_would_probe": [
                1
            ],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound" 
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2017-07-28 05:53:34.325176" 
        }
    ],

/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455398

History

#1 Updated by Sage Weil 3 months ago

/a/sage-2017-08-01_15:32:10-rados-wip-sage-testing-distro-basic-smithi/1469176

rados/thrash-erasure-code/{ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-end.yaml fast/fast.yaml leveldb.yaml msgr-failures/osd-delay.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/mapgap.yaml thrashosds-health.yaml workloads/ec-rados-plugin=jerasure-k=3-m=1.yaml}

#2 Updated by Sage Weil 3 months ago

root@smithi200:~# ceph tell 2.1b query 
{
    "state": "incomplete",
    "snap_trimq": "[]",
    "epoch": 3426,
    "up": [
        2,
        3,
        0,
        5
    ],
    "acting": [
        2,
        3,
        0,
        5
    ],
...
            "probing_osds": [
                "0(2)",
                "2(0)",
                "3(1)",
                "5(3)" 
            ],
            "down_osds_we_would_probe": [
                1
            ],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound" 
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2017-08-01 16:36:39.822172" 
        }
    ],
    "agent_state": {}
}

#3 Updated by Sage Weil 3 months ago

/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1470073

pg 2.d on [5,1,4]

#4 Updated by Sage Weil 3 months ago

  • Priority changed from Urgent to Immediate

#5 Updated by Sage Weil 3 months ago

This appears to be a test problem:

- the thrashosds has 'chance_test_map_discontinuity: 0.5', which will mark an osd down, wait for things to go clean, and then bring it back up.
- the workload creates an ec pool with the teuthologyprofile profile, which is apparent k=2 m=1. That means min_size=2+1=3, and any osd down will usually prevent us from going clean.

I'm not sure why we're seeing this now and we weren't before. It seems like the fix is to do something like k=2 m=2, though?

#6 Updated by Sage Weil 3 months ago

  • Status changed from Verified to Need Review

#7 Updated by Sage Weil 3 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF