Support #8310
closedMost pgs stuck stale, no osds reporting them, repair ineffective
0%
Description
After trying to resolve an issue with pgs stuck in cleaning, I restarted osds and most of the pgs in the cluster now report stale, with no osds active. ceph pg query hangs, ceph pg map is as
- ceph pg map 11.0
osdmap e4344 pg 11.0 (11.0) -> up [] acting []
- du -chs /var/lib/ceph/osd/osd.0/current/11.0_head/
2.3G /var/lib/ceph/osd/osd.0/current/11.0_head/
2.3G total
Metadata seems intact
- attr -l /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "ceph._" has a 294 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "ceph._user.rgw.acl" has a 151 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "selinux" has a 28 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
[root@compute1 ceph]# attr -g ceph._ -q /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b > /tmp/ceph_
[root@compute1 ceph]# ceph-dencoder type object_info_t import /tmp/ceph_ decode dump_json { "oid": { "oid": "30060.111_repo0\/Application\/noarch\/jboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm",
"key": "",
"snapid": -2,
"hash": 589371368,
"max": 0,
"pool": 11,
"namespace": ""},
"category": "",
"version": "156'7478",
"prior_version": "0'0",
"last_reqid": "client.30060.0:255989",
"user_version": 7478,
"size": 3490,
"mtime": "2013-07-08 13:57:18.030270",
"lost": 0,
"flags": 0,
"wrlock_by": "unknown.0.0:0",
"snaps": [],
"truncate_seq": 0,
"truncate_size": 0,
"watchers": {}}
Files
Updated by Jeff Bachtel almost 10 years ago
I mentioned "repair ineffective" without detail. Specifically, I have tried pg repair on all stale pgs, osd scrubs, osd deep-scrubs, and osd repairs. As far as I can tell, there's no movement on any of it. I've bumped up thresholds to try to force scrubs to complete
osd max scrubs = 200
osd scrub load threshold = 20
osd recovery max active = 300
I've also tried setting noscrub/nodeep-scrub in case scrub jobs were interfering with osd repair, no movement either way.
Updated by Greg Farnum almost 10 years ago
You'll generally have better luck with stuff like this on the mailing list. But I see that your PGs aren't mapping to any OSDs, so probably your CRUSH map got broken for some reason when restarting. Look at "ceph osd tree" and fix it up.
Updated by Jeff Bachtel almost 10 years ago
ceph osd tree revealed that I had used
ceph osd reweight osd# weight#
instead of
ceph osd crush reweight osd.osd# weight#
when I was reweighting for drive space. Sage caught it on ceph-users and set me straight. Cluster's recovering right now. If there are further problems they're unrelated to this particular issue. Closing with thanks to Sage and Greg.