Support #8310
Most pgs stuck stale, no osds reporting them, repair ineffective
0%
Description
After trying to resolve an issue with pgs stuck in cleaning, I restarted osds and most of the pgs in the cluster now report stale, with no osds active. ceph pg query hangs, ceph pg map is as
- ceph pg map 11.0
osdmap e4344 pg 11.0 (11.0) -> up [] acting []
- du -chs /var/lib/ceph/osd/osd.0/current/11.0_head/
2.3G /var/lib/ceph/osd/osd.0/current/11.0_head/
2.3G total
Metadata seems intact
- attr -l /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "ceph._" has a 294 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "ceph._user.rgw.acl" has a 151 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
Attribute "selinux" has a 28 byte value for /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\urepo0\sApplication\snoarch\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b
[root@compute1 ceph]# attr -g ceph._ -q /var/lib/ceph/osd/osd.0/current/11.0_head/30060.111\\urepo0\\sApplication\\snoarch\\sjboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm__head_232117E8__b > /tmp/ceph_
[root@compute1 ceph]# ceph-dencoder type object_info_t import /tmp/ceph_ decode dump_json { "oid": { "oid": "30060.111_repo0\/Application\/noarch\/jboss-mod-evr-server-1.3.0-SNAPSHOT20130701200858.noarch.rpm",
"key": "",
"snapid": -2,
"hash": 589371368,
"max": 0,
"pool": 11,
"namespace": ""},
"category": "",
"version": "156'7478",
"prior_version": "0'0",
"last_reqid": "client.30060.0:255989",
"user_version": 7478,
"size": 3490,
"mtime": "2013-07-08 13:57:18.030270",
"lost": 0,
"flags": 0,
"wrlock_by": "unknown.0.0:0",
"snaps": [],
"truncate_seq": 0,
"truncate_size": 0,
"watchers": {}}
History
#1 Updated by Jeff Bachtel almost 9 years ago
I mentioned "repair ineffective" without detail. Specifically, I have tried pg repair on all stale pgs, osd scrubs, osd deep-scrubs, and osd repairs. As far as I can tell, there's no movement on any of it. I've bumped up thresholds to try to force scrubs to complete
osd max scrubs = 200
osd scrub load threshold = 20
osd recovery max active = 300
I've also tried setting noscrub/nodeep-scrub in case scrub jobs were interfering with osd repair, no movement either way.
#2 Updated by Greg Farnum almost 9 years ago
You'll generally have better luck with stuff like this on the mailing list. But I see that your PGs aren't mapping to any OSDs, so probably your CRUSH map got broken for some reason when restarting. Look at "ceph osd tree" and fix it up.
#3 Updated by Jeff Bachtel almost 9 years ago
ceph osd tree revealed that I had used
ceph osd reweight osd# weight#
instead of
ceph osd crush reweight osd.osd# weight#
when I was reweighting for drive space. Sage caught it on ceph-users and set me straight. Cluster's recovering right now. If there are further problems they're unrelated to this particular issue. Closing with thanks to Sage and Greg.
#4 Updated by Loïc Dachary over 8 years ago
- Status changed from New to Closed