Project

General

Profile

Actions

Bug #22561

open

PG stuck during recovery, requires OSD restart

Added by Paul Emmerich over 6 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are sometimes encountering issues with PGs getting stuck in recovery.

For example, we ran some stress tests with bluestore and EC on our test cluster and we've randomly swapped disks between servers and turned off servers during heavy load. This resulted in a quite bad state and a few PGs got stuck in the recovery_wait state:

The affected PGs reported unfound objects and the recovery state looked like this:

    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2018-01-04 00:31:01.884650",
            "might_have_unfound": [
                {
                    "osd": "5(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "5(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "9(3)",
                    "status": "already probed" 
                },
                {
                    "osd": "23(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "26(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "28(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "31(1)",
                    "status": "already probed" 
                },
                {
                    "osd": "32(2)",
                    "status": "querying" 
                },
                {
                    "osd": "41(4)",
                    "status": "already probed" 
                },
                {
                    "osd": "51(1)",
                    "status": "already probed" 
                },
                {
                    "osd": "59(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "71(5)",
                    "status": "already probed" 
                }
            ],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.seed": 0,
                "scrubber.waiting_on": 0,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2018-01-04 00:30:16.663165" 
        }

It never started recovery and it apparently never tried to contact OSD 32 (which had the missing objects). Couldn't find anything useful in the logs, but unfortunately didn't save them :/

The only way to get recovery started was to restart OSD 32 which had the unfound objects. It recovered fine after that.
This affected ~20 PGs out of ~5000.
We tried to wait for > 30 minutes on one of the PGs and the state didn't change with the cluster completely idle.

We had to restart the OSDs it was trying to contact, not the primary of the PG.

We have also seen this happen in a production cluster, but during backfill and with no unfound objects. This happened after a bigger network outage.

ceph 12.2.2


Files

ceph pg query.json (62.8 KB) ceph pg query.json Paul Emmerich, 01/04/2018 01:06 AM
Actions

Also available in: Atom PDF