Project

General

Profile

Bug #20844

peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml

Added by Sage Weil over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-07-28T05:53:34.334 INFO:tasks.thrashosds.thrasher:Waiting for clean again

but
    pgs:     1.852% pgs not active
             52 active+clean
             1  incomplete
             1  active+clean+scrubbing

query is
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/Incomplete",
            "enter_time": "2017-07-28 05:53:34.326369",
            "comment": "not enough complete instances of this PG" 
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2017-07-28 05:53:34.325201",
            "past_intervals": [
                {
                    "first": "672",
                    "last": "763",
                    "all_participants": [
                        {
                            "osd": 0,
                            "shard": 2
                        },
                        {
                            "osd": 1,
                            "shard": 0
                        },
                        {
                            "osd": 3,
                            "shard": 1
                        }
                    ],
                    "intervals": [
                        {
                            "first": "672",
                            "last": "761",
                            "acting": "0(2),1(0),3(1)" 
                        }
                    ]
                }
            ],
            "probing_osds": [
                "0(2)",
                "3(1)",
                "4(0)" 
            ],
            "down_osds_we_would_probe": [
                1
            ],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound" 
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2017-07-28 05:53:34.325176" 
        }
    ],

/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455398

History

#1 Updated by Sage Weil over 6 years ago

/a/sage-2017-08-01_15:32:10-rados-wip-sage-testing-distro-basic-smithi/1469176

rados/thrash-erasure-code/{ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-require-luminous/at-end.yaml fast/fast.yaml leveldb.yaml msgr-failures/osd-delay.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/mapgap.yaml thrashosds-health.yaml workloads/ec-rados-plugin=jerasure-k=3-m=1.yaml}

#2 Updated by Sage Weil over 6 years ago

root@smithi200:~# ceph tell 2.1b query 
{
    "state": "incomplete",
    "snap_trimq": "[]",
    "epoch": 3426,
    "up": [
        2,
        3,
        0,
        5
    ],
    "acting": [
        2,
        3,
        0,
        5
    ],
...
            "probing_osds": [
                "0(2)",
                "2(0)",
                "3(1)",
                "5(3)" 
            ],
            "down_osds_we_would_probe": [
                1
            ],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound" 
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2017-08-01 16:36:39.822172" 
        }
    ],
    "agent_state": {}
}

#3 Updated by Sage Weil over 6 years ago

/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1470073

pg 2.d on [5,1,4]

#4 Updated by Sage Weil over 6 years ago

  • Priority changed from Urgent to Immediate

#5 Updated by Sage Weil over 6 years ago

This appears to be a test problem:

- the thrashosds has 'chance_test_map_discontinuity: 0.5', which will mark an osd down, wait for things to go clean, and then bring it back up.
- the workload creates an ec pool with the teuthologyprofile profile, which is apparent k=2 m=1. That means min_size=2+1=3, and any osd down will usually prevent us from going clean.

I'm not sure why we're seeing this now and we weren't before. It seems like the fix is to do something like k=2 m=2, though?

#6 Updated by Sage Weil over 6 years ago

  • Status changed from 12 to Fix Under Review

#7 Updated by Sage Weil over 6 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF