Bug #22561: PG stuck during recovery, requires OSD restart - RADOS - Ceph

Actions

Copy link

Bug #22561

open

PG stuck during recovery, requires OSD restart

Added by Paul Emmerich over 6 years ago. Updated over 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.0, Ceph - v12.2.2

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are sometimes encountering issues with PGs getting stuck in recovery.

For example, we ran some stress tests with bluestore and EC on our test cluster and we've randomly swapped disks between servers and turned off servers during heavy load. This resulted in a quite bad state and a few PGs got stuck in the recovery_wait state:

The affected PGs reported unfound objects and the recovery state looked like this:

    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2018-01-04 00:31:01.884650",
            "might_have_unfound": [
                {
                    "osd": "5(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "5(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "9(3)",
                    "status": "already probed" 
                },
                {
                    "osd": "23(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "26(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "28(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "31(1)",
                    "status": "already probed" 
                },
                {
                    "osd": "32(2)",
                    "status": "querying" 
                },
                {
                    "osd": "41(4)",
                    "status": "already probed" 
                },
                {
                    "osd": "51(1)",
                    "status": "already probed" 
                },
                {
                    "osd": "59(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "71(5)",
                    "status": "already probed" 
                }
            ],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.seed": 0,
                "scrubber.waiting_on": 0,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2018-01-04 00:30:16.663165" 
        }

It never started recovery and it apparently never tried to contact OSD 32 (which had the missing objects). Couldn't find anything useful in the logs, but unfortunately didn't save them :/

The only way to get recovery started was to restart OSD 32 which had the unfound objects. It recovered fine after that.
This affected ~20 PGs out of ~5000.
We tried to wait for > 30 minutes on one of the PGs and the state didn't change with the cluster completely idle.

We had to restart the OSDs it was trying to contact, not the primary of the PG.

We have also seen this happen in a production cluster, but during backfill and with no unfound objects. This happened after a bigger network outage.

ceph 12.2.2

Files

ceph pg query.json (62.8 KB) ceph pg query.json

Paul Emmerich, 01/04/2018 01:06 AM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #22561

PG stuck during recovery, requires OSD restart

Updated by Patrick Donnelly over 6 years ago

Updated by Josh Durgin over 6 years ago

Updated by Paul Emmerich over 6 years ago

Updated by Paul Emmerich over 5 years ago