Project

General

Profile

Support #8310

Most pgs stuck stale, no osds reporting them, repair ineffective

Added by Jeff Bachtel about 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

After trying to resolve an issue with pgs stuck in cleaning, I restarted osds and most of the pgs in the cluster now report stale, with no osds active. ceph pg query hangs, ceph pg map is as

  1. ceph pg map 11.0
    osdmap e4344 pg 11.0 (11.0) -> up [] acting []
  1. du -chs /var/lib/ceph/osd/osd.0/current/11.0_head/
    2.3G /var/lib/ceph/osd/osd.0/current/11.0_head/
    2.3G total

Metadata seems intact

  1. attr -l /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
    Attribute "ceph._" has a 294 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
    Attribute "ceph._user.rgw.acl" has a 151 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
    Attribute "selinux" has a 28 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
    [root@compute1 ceph]# attr -g ceph._ -q /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b > /tmp/ceph_
    [root@compute1 ceph]# ceph-dencoder type object_info_t import /tmp/ceph_ decode dump_json { "oid": { "oid": "30060.111_repo0\/Application\/noarch\/jboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm",
    "key": "",
    "snapid": -2,
    "hash": 589371368,
    "max": 0,
    "pool": 11,
    "namespace": ""},
    "category": "",
    "version": "156'7478",
    "prior_version": "0'0",
    "last_reqid": "client.30060.0:255989",
    "user_version": 7478,
    "size": 3490,
    "mtime": "2013-07-08 13:57:18.030270",
    "lost": 0,
    "flags": 0,
    "wrlock_by": "unknown.0.0:0",
    "snaps": [],
    "truncate_seq": 0,
    "truncate_size": 0,
    "watchers": {}}

ceph-osd.0.log View - ms = 20, osd = 20, first 20k lines after startup (3.29 MB) Jeff Bachtel, 05/08/2014 05:03 AM

History

#1 Updated by Jeff Bachtel about 6 years ago

I mentioned "repair ineffective" without detail. Specifically, I have tried pg repair on all stale pgs, osd scrubs, osd deep-scrubs, and osd repairs. As far as I can tell, there's no movement on any of it. I've bumped up thresholds to try to force scrubs to complete

osd max scrubs = 200
osd scrub load threshold = 20
osd recovery max active = 300

I've also tried setting noscrub/nodeep-scrub in case scrub jobs were interfering with osd repair, no movement either way.

#2 Updated by Greg Farnum about 6 years ago

You'll generally have better luck with stuff like this on the mailing list. But I see that your PGs aren't mapping to any OSDs, so probably your CRUSH map got broken for some reason when restarting. Look at "ceph osd tree" and fix it up.

#3 Updated by Jeff Bachtel about 6 years ago

ceph osd tree revealed that I had used

ceph osd reweight osd# weight#

instead of

ceph osd crush reweight osd.osd# weight#

when I was reweighting for drive space. Sage caught it on ceph-users and set me straight. Cluster's recovering right now. If there are further problems they're unrelated to this particular issue. Closing with thanks to Sage and Greg.

#4 Updated by Loic Dachary almost 6 years ago

  • Status changed from New to Closed

Also available in: Atom PDF