Project

General

Profile

Actions

Bug #13862

closed

pgs stuck inconsistent after infernalis upgrade

Added by Logan V over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
infernalis
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During my infernalis upgrade I was running unmatched OSD versions for about 48 hours while I did a rolling chown/upgrade across ~30 hosts and ~200 OSDs. I noticed during the upgrade process that some pgs were going inconsistent. After finishing the upgrade completely across all OSDs I had about 80 pgs marked inconsistent, all in the same EC pool.

I did a mass repair across these 80 pgs and many of them completed repair successfully.

However, 36 did not repair and failed with the same error message. These 36 pgs have been stuck inconsistent for about a week now.

The inconsistent pgs:

# ceph health detail
HEALTH_ERR 36 pgs inconsistent; 80 scrub errors; noout flag(s) set
pg 33.f62 is active+clean+inconsistent, acting [143,136,77,39]
pg 33.e6c is active+clean+inconsistent, acting [133,105,74,67]
pg 33.e02 is active+clean+inconsistent, acting [67,114,104,91]
pg 33.d97 is active+clean+inconsistent, acting [108,68,58,101]
pg 33.d62 is active+clean+inconsistent, acting [138,100,46,141]
pg 33.d47 is active+clean+inconsistent, acting [89,23,69,77]
pg 33.caa is active+clean+inconsistent, acting [99,84,79,95]
pg 33.ca1 is active+clean+inconsistent, acting [88,144,108,107]
pg 33.c73 is active+clean+inconsistent, acting [94,66,90,116]
pg 33.c06 is active+clean+inconsistent, acting [135,116,63,138]
pg 33.be0 is active+clean+inconsistent, acting [137,112,102,82]
pg 33.bae is active+clean+inconsistent, acting [133,104,94,101]
pg 33.b48 is active+clean+inconsistent, acting [135,95,77,72]
pg 33.b21 is active+clean+inconsistent, acting [98,67,83,62]
pg 33.a5f is active+clean+inconsistent, acting [134,131,95,96]
pg 33.a46 is active+clean+inconsistent, acting [133,138,105,55]
pg 33.7b5 is active+clean+inconsistent, acting [110,63,75,83]
pg 33.3bf is active+clean+inconsistent, acting [135,98,116,88]
pg 33.329 is active+clean+inconsistent, acting [103,91,108,132]
pg 33.1ac is active+clean+inconsistent, acting [138,34,105,128]
pg 33.12b is active+clean+inconsistent, acting [81,79,100,74]
pg 33.279 is active+clean+inconsistent, acting [95,105,34,68]
pg 33.497 is active+clean+inconsistent, acting [90,63,129,94]
pg 33.537 is active+clean+inconsistent, acting [141,108,24,72]
pg 33.518 is active+clean+inconsistent, acting [95,140,67,79]
pg 33.5a9 is active+clean+inconsistent, acting [116,79,143,138]
pg 33.701 is active+clean+inconsistent, acting [96,40,140,104]
pg 33.594 is active+clean+inconsistent, acting [72,93,81,111]
pg 33.5d1 is active+clean+inconsistent, acting [81,133,77,136]
pg 33.632 is active+clean+inconsistent, acting [114,112,86,18]
pg 33.610 is active+clean+inconsistent, acting [112,142,103,68]
pg 33.6b5 is active+clean+inconsistent, acting [143,135,139,84]
pg 33.768 is active+clean+inconsistent, acting [89,92,108,24]
pg 33.8ab is active+clean+inconsistent, acting [95,138,99,116]
pg 33.8c4 is active+clean+inconsistent, acting [50,111,98,100]
pg 33.913 is active+clean+inconsistent, acting [74,138,113,72]

Output from ceph -w when I tell one of these pg's to repair:

2015-11-23 10:26:28.864457 mon.1 [INF] from='client.? 10.10.7.10:0/2688742338' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "33.f62"}]: dispatch
2015-11-23 10:26:31.383884 osd.143 [INF] 33.f62 repair starts
2015-11-23 10:26:33.209568 osd.143 [ERR] 33.f62s0 shard 4(0): soid failed to pick suitable auth object
2015-11-23 10:26:33.209861 osd.143 [ERR] repair 33.f62s0 -1/00000000/temp_33.f62s0_0_93345312_70/head no 'snapset' attr

I have tried the following process to resolve this as well:
1) Stopped the OSD failing the repair with this error
2) Delete the inconsistent pg from the osd
3) Restart the OSD
4) Repair the pg again, usually the repair will succeed now
Then after waiting several hours the pg will be remarked as inconsistent.

This is a dev pool on older drives, older hosts, etc. so I cannot rule out some host issue causing this, but I notice that the inconsistent pgs do not appear to have any particular osds in common as I would usually see with host issues in the past. I would be happy to perform additional debugging, but I need a little guidance on what would be useful.


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #13381: osd/SnapMapper.cc: 282: FAILED assert(check(oid)) on hammer->jewel upgradeWon't FixSage Weil

Actions
Copied to Ceph - Backport #14494: infernalis: pgs stuck inconsistent after infernalis upgradeResolvedAbhishek VarshneyActions
Actions

Also available in: Atom PDF