Bug #13862
closedpgs stuck inconsistent after infernalis upgrade
0%
Description
During my infernalis upgrade I was running unmatched OSD versions for about 48 hours while I did a rolling chown/upgrade across ~30 hosts and ~200 OSDs. I noticed during the upgrade process that some pgs were going inconsistent. After finishing the upgrade completely across all OSDs I had about 80 pgs marked inconsistent, all in the same EC pool.
I did a mass repair across these 80 pgs and many of them completed repair successfully.
However, 36 did not repair and failed with the same error message. These 36 pgs have been stuck inconsistent for about a week now.
The inconsistent pgs:
# ceph health detail HEALTH_ERR 36 pgs inconsistent; 80 scrub errors; noout flag(s) set pg 33.f62 is active+clean+inconsistent, acting [143,136,77,39] pg 33.e6c is active+clean+inconsistent, acting [133,105,74,67] pg 33.e02 is active+clean+inconsistent, acting [67,114,104,91] pg 33.d97 is active+clean+inconsistent, acting [108,68,58,101] pg 33.d62 is active+clean+inconsistent, acting [138,100,46,141] pg 33.d47 is active+clean+inconsistent, acting [89,23,69,77] pg 33.caa is active+clean+inconsistent, acting [99,84,79,95] pg 33.ca1 is active+clean+inconsistent, acting [88,144,108,107] pg 33.c73 is active+clean+inconsistent, acting [94,66,90,116] pg 33.c06 is active+clean+inconsistent, acting [135,116,63,138] pg 33.be0 is active+clean+inconsistent, acting [137,112,102,82] pg 33.bae is active+clean+inconsistent, acting [133,104,94,101] pg 33.b48 is active+clean+inconsistent, acting [135,95,77,72] pg 33.b21 is active+clean+inconsistent, acting [98,67,83,62] pg 33.a5f is active+clean+inconsistent, acting [134,131,95,96] pg 33.a46 is active+clean+inconsistent, acting [133,138,105,55] pg 33.7b5 is active+clean+inconsistent, acting [110,63,75,83] pg 33.3bf is active+clean+inconsistent, acting [135,98,116,88] pg 33.329 is active+clean+inconsistent, acting [103,91,108,132] pg 33.1ac is active+clean+inconsistent, acting [138,34,105,128] pg 33.12b is active+clean+inconsistent, acting [81,79,100,74] pg 33.279 is active+clean+inconsistent, acting [95,105,34,68] pg 33.497 is active+clean+inconsistent, acting [90,63,129,94] pg 33.537 is active+clean+inconsistent, acting [141,108,24,72] pg 33.518 is active+clean+inconsistent, acting [95,140,67,79] pg 33.5a9 is active+clean+inconsistent, acting [116,79,143,138] pg 33.701 is active+clean+inconsistent, acting [96,40,140,104] pg 33.594 is active+clean+inconsistent, acting [72,93,81,111] pg 33.5d1 is active+clean+inconsistent, acting [81,133,77,136] pg 33.632 is active+clean+inconsistent, acting [114,112,86,18] pg 33.610 is active+clean+inconsistent, acting [112,142,103,68] pg 33.6b5 is active+clean+inconsistent, acting [143,135,139,84] pg 33.768 is active+clean+inconsistent, acting [89,92,108,24] pg 33.8ab is active+clean+inconsistent, acting [95,138,99,116] pg 33.8c4 is active+clean+inconsistent, acting [50,111,98,100] pg 33.913 is active+clean+inconsistent, acting [74,138,113,72]
Output from ceph -w when I tell one of these pg's to repair:
2015-11-23 10:26:28.864457 mon.1 [INF] from='client.? 10.10.7.10:0/2688742338' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "33.f62"}]: dispatch 2015-11-23 10:26:31.383884 osd.143 [INF] 33.f62 repair starts 2015-11-23 10:26:33.209568 osd.143 [ERR] 33.f62s0 shard 4(0): soid failed to pick suitable auth object 2015-11-23 10:26:33.209861 osd.143 [ERR] repair 33.f62s0 -1/00000000/temp_33.f62s0_0_93345312_70/head no 'snapset' attr
I have tried the following process to resolve this as well:
1) Stopped the OSD failing the repair with this error
2) Delete the inconsistent pg from the osd
3) Restart the OSD
4) Repair the pg again, usually the repair will succeed now
Then after waiting several hours the pg will be remarked as inconsistent.
This is a dev pool on older drives, older hosts, etc. so I cannot rule out some host issue causing this, but I notice that the inconsistent pgs do not appear to have any particular osds in common as I would usually see with host issues in the past. I would be happy to perform additional debugging, but I need a little guidance on what would be useful.