Support #8310: Most pgs stuck stale, no osds reporting them, repair ineffective - Ceph - Ceph

Actions

Copy link

Support #8310

closed

Most pgs stuck stale, no osds reporting them, repair ineffective

Added by Jeff Bachtel almost 10 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

Urgent

Assignee:

Category:

OSD

Target version:

0.80

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

After trying to resolve an issue with pgs stuck in cleaning, I restarted osds and most of the pgs in the cluster now report stale, with no osds active. ceph pg query hangs, ceph pg map is as

ceph pg map 11.0
osdmap e4344 pg 11.0 (11.0) -> up [] acting []

du -chs /var/lib/ceph/osd/osd.0/current/11.0_head/
2.3G /var/lib/ceph/osd/osd.0/current/11.0_head/
2.3G total

Metadata seems intact

attr -l /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "ceph._" has a 294 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "ceph._user.rgw.acl" has a 151 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "selinux" has a 28 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
[root@compute1 ceph]# attr -g ceph._ -q /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b > /tmp/ceph_
[root@compute1 ceph]# ceph-dencoder type object_info_t import /tmp/ceph_ decode dump_json { "oid": { "oid": "30060.111_repo0\/Application\/noarch\/jboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm",
"key": "",
"snapid": -2,
"hash": 589371368,
"max": 0,
"pool": 11,
"namespace": ""},
"category": "",
"version": "156'7478",
"prior_version": "0'0",
"last_reqid": "client.30060.0:255989",
"user_version": 7478,
"size": 3490,
"mtime": "2013-07-08 13:57:18.030270",
"lost": 0,
"flags": 0,
"wrlock_by": "unknown.0.0:0",
"snaps": [],
"truncate_seq": 0,
"truncate_size": 0,
"watchers": {}}

Files

ceph-osd.0.log (3.29 MB) ceph-osd.0.log

ms = 20, osd = 20, first 20k lines after startup

Jeff Bachtel, 05/08/2014 05:03 AM

Actions

Copy link

Updated by Jeff Bachtel almost 10 years ago

I mentioned "repair ineffective" without detail. Specifically, I have tried pg repair on all stale pgs, osd scrubs, osd deep-scrubs, and osd repairs. As far as I can tell, there's no movement on any of it. I've bumped up thresholds to try to force scrubs to complete

osd max scrubs = 200
  osd scrub load threshold = 20
  osd recovery max active = 300

I've also tried setting noscrub/nodeep-scrub in case scrub jobs were interfering with osd repair, no movement either way.

Actions

Copy link

Updated by Greg Farnum almost 10 years ago

You'll generally have better luck with stuff like this on the mailing list. But I see that your PGs aren't mapping to any OSDs, so probably your CRUSH map got broken for some reason when restarting. Look at "ceph osd tree" and fix it up.

Actions

Copy link

Updated by Jeff Bachtel almost 10 years ago

ceph osd tree revealed that I had used

ceph osd reweight osd# weight#

instead of

ceph osd crush reweight osd.osd# weight#

when I was reweighting for drive space. Sage caught it on ceph-users and set me straight. Cluster's recovering right now. If there are further problems they're unrelated to this particular issue. Closing with thanks to Sage and Greg.

Actions

Copy link