Project

General

Profile

Actions

Bug #22561

open

PG stuck during recovery, requires OSD restart

Added by Paul Emmerich over 6 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are sometimes encountering issues with PGs getting stuck in recovery.

For example, we ran some stress tests with bluestore and EC on our test cluster and we've randomly swapped disks between servers and turned off servers during heavy load. This resulted in a quite bad state and a few PGs got stuck in the recovery_wait state:

The affected PGs reported unfound objects and the recovery state looked like this:

    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2018-01-04 00:31:01.884650",
            "might_have_unfound": [
                {
                    "osd": "5(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "5(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "9(3)",
                    "status": "already probed" 
                },
                {
                    "osd": "23(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "26(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "28(5)",
                    "status": "already probed" 
                },
                {
                    "osd": "31(1)",
                    "status": "already probed" 
                },
                {
                    "osd": "32(2)",
                    "status": "querying" 
                },
                {
                    "osd": "41(4)",
                    "status": "already probed" 
                },
                {
                    "osd": "51(1)",
                    "status": "already probed" 
                },
                {
                    "osd": "59(2)",
                    "status": "already probed" 
                },
                {
                    "osd": "71(5)",
                    "status": "already probed" 
                }
            ],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.seed": 0,
                "scrubber.waiting_on": 0,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2018-01-04 00:30:16.663165" 
        }

It never started recovery and it apparently never tried to contact OSD 32 (which had the missing objects). Couldn't find anything useful in the logs, but unfortunately didn't save them :/

The only way to get recovery started was to restart OSD 32 which had the unfound objects. It recovered fine after that.
This affected ~20 PGs out of ~5000.
We tried to wait for > 30 minutes on one of the PGs and the state didn't change with the cluster completely idle.

We had to restart the OSDs it was trying to contact, not the primary of the PG.

We have also seen this happen in a production cluster, but during backfill and with no unfound objects. This happened after a bigger network outage.

ceph 12.2.2


Files

ceph pg query.json (62.8 KB) ceph pg query.json Paul Emmerich, 01/04/2018 01:06 AM
Actions #1

Updated by Patrick Donnelly over 6 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #2

Updated by Josh Durgin over 6 years ago

Was OSD 32 running at the time? It sounds like correct behavior if OSD 32 was not reachable. It might have been marked down for some reason automatically, and stopped itself.

If you want to reproduce, logs with debug ms = 1 and debug osd = 20 would let us figure out what happened.

Actions #3

Updated by Paul Emmerich over 6 years ago

OSD 32 was running and actively serving client IO.

Actions #4

Updated by Paul Emmerich over 5 years ago

I've just encountered this again with about 20 OSDs being non-responsive like this. Restarting the OSDs in that state helped. Again, the other OSD was up and running, nothing odd going on.

The 20 OSDs were spread across 3 different servers in the cluster which is a mix of 12.2.1 and 12.2.0. All affected OSDs were 12.2.1 and had an uptime of almost exactly 1 year (364 days).
No 12.2.0 OSDs with the same uptime were affected. No 12.2.1 OSDs with a shorter uptime were affected.

Actions

Also available in: Atom PDF