Project

General

Profile

Actions

Support #24980

closed

Pg Inconsistent - failed to pick suitable auth object

Added by Alon Avrahami almost 6 years ago. Updated almost 6 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
Pull request ID:

Description

Hi,

We have ceph cluster installed with Luminous 12.2.2 using bluestore.
All nodes are Intel servers with 1.6TB Intel SSDs running ubuntu 16.04.3/16.04.4

While adding new OSDs to the cluster (one by one) we noticed some PG inconsistent alert on all new OSDs.
tried to figure out why it happened and decided to remove them from the cluster till finding the solution. (Though they may be the problem, all 4 out of 4 new noded had inconsistent PGs in them)

The cluster status:

root@ecprdbcph10-opens:~# ceph -s
cluster:
id: 220fe94c-bd2a-475b-aeca-6ba8baaf67f3
health: HEALTH_ERR
2933 scrub errors
Possible data damage: 17 pgs inconsistent

services:
mon: 3 daemons, quorum ecprdbcph10-opens,ecprdbcph11-opens,ecprdbcph12-opens
mgr: ecprdbcph12-opens(active), standbys: ecprdbcph10-opens, ecprdbcph11-opens
osd: 320 osds: 319 up, 319 in
data:
pools: 5 pools, 8292 pgs
objects: 17894k objects, 71465 GB
usage: 205 TB used, 258 TB / 464 TB avail
pgs: 8274 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing
io:
client: 136 MB/s rd, 262 MB/s wr, 2567 op/s rd, 8593 op/s wr

ceph health detail:

HEALTH_ERR 2936 scrub errors; Possible data damage: 17 pgs inconsistent
OSD_SCRUB_ERRORS 2936 scrub errors
PG_DAMAGED Possible data damage: 17 pgs inconsistent
pg 2.199 is active+clean+inconsistent, acting [194,86,137]
pg 2.411 is active+clean+inconsistent, acting [16,283,294]
pg 2.442 is active+clean+inconsistent, acting [120,310,51]
pg 2.537 is active+clean+inconsistent, acting [235,119,40]
pg 2.732 is active+clean+inconsistent, acting [132,33,50]
pg 2.734 is active+clean+inconsistent, acting [88,316,237]
pg 2.7fc is active+clean+inconsistent, acting [126,134,7]
pg 2.923 is active+clean+inconsistent, acting [102,272,253]
pg 2.95b is active+clean+inconsistent, acting [102,149,119]
pg 2.966 is active+clean+inconsistent, acting [162,88,36]
pg 2.9b1 is active+clean+inconsistent, acting [5,42,303]
pg 2.9d9 is active+clean+inconsistent, acting [307,282,116]
pg 2.a10 is active+clean+inconsistent, acting [157,203,62]
pg 2.bb7 is active+clean+inconsistent, acting [12,152,104]
pg 2.c16 is active+clean+inconsistent, acting [154,164,81]
pg 2.cd7 is active+clean+inconsistent, acting [267,67,154]
pg 2.fb9 is active+clean+inconsistent, acting [242,132,7]

Tried to run "ceph pg repair 2.fb9" but unsuccessfully.
failed with failed to pick suitable auth object error.

log from /var/log/ceph/ceph.log:

2018-07-18 08:07:04.726417 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head data_digest 0x3d09131b != data_digest 0xd489ac9f from auth oi 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head(226987'10189895 client.392080848.0:105233287 dirty|data_digest|omap_digest s 4194304 uv 10189895 dd d489ac9f od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:04.726423 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head data_digest 0x3d09131b != data_digest 0xd489ac9f from auth oi 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head(226987'10189895 client.392080848.0:105233287 dirty|data_digest|omap_digest s 4194304 uv 10189895 dd d489ac9f od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:04.726429 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head data_digest 0x3d09131b != data_digest 0xd489ac9f from auth oi 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head(226987'10189895 client.392080848.0:105233287 dirty|data_digest|omap_digest s 4194304 uv 10189895 dd d489ac9f od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:04.726433 osd.242 [ERR] 2.fb9 soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head: failed to pick suitable auth object
2018-07-18 08:07:16.607945 mon.ecprdbcph10-opens [ERR] Health check update: 2933 scrub errors (OSD_SCRUB_ERRORS)
2018-07-18 08:07:16.608023 mon.ecprdbcph10-opens [ERR] Health check update: Possible data damage: 17 pgs inconsistent (PG_DAMAGED)
2018-07-18 08:07:09.800852 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head data_digest 0x892117c8 != data_digest 0xcd09aede from auth oi 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head(227188'10663601 client.377751309.0:125366300 dirty|data_digest|omap_digest s 4194304 uv 10663601 dd cd09aede od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:09.800856 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head data_digest 0x892117c8 != data_digest 0xcd09aede from auth oi 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head(227188'10663601 client.377751309.0:125366300 dirty|data_digest|omap_digest s 4194304 uv 10663601 dd cd09aede od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:09.800861 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head data_digest 0x892117c8 != data_digest 0xcd09aede from auth oi 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head(227188'10663601 client.377751309.0:125366300 dirty|data_digest|omap_digest s 4194304 uv 10663601 dd cd09aede od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:09.800864 osd.242 [ERR] 2.fb9 soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head: failed to pick suitable auth object
2018-07-18 08:07:11.444986 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head data_digest 0x394343cd != data_digest 0x20969b49 from auth oi 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head(227178'10553559 client.480517048.0:161698532 dirty|data_digest|omap_digest s 4194304 uv 10553559 dd 20969b49 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.444998 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head data_digest 0x394343cd != data_digest 0x20969b49 from auth oi 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head(227178'10553559 client.480517048.0:161698532 dirty|data_digest|omap_digest s 4194304 uv 10553559 dd 20969b49 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445004 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head data_digest 0x394343cd != data_digest 0x20969b49 from auth oi 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head(227178'10553559 client.480517048.0:161698532 dirty|data_digest|omap_digest s 4194304 uv 10553559 dd 20969b49 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445009 osd.242 [ERR] 2.fb9 soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head: failed to pick suitable auth object
2018-07-18 08:07:11.445012 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head data_digest 0xbae16ae3 != data_digest 0xd0e18ad6 from auth oi 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head(226987'10352362 client.361453256.0:153220873 dirty|data_digest|omap_digest s 4194304 uv 10352362 dd d0e18ad6 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445017 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head data_digest 0xbae16ae3 != data_digest 0xd0e18ad6 from auth oi 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head(226987'10352362 client.361453256.0:153220873 dirty|data_digest|omap_digest s 4194304 uv 10352362 dd d0e18ad6 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445024 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head data_digest 0xbae16ae3 != data_digest 0xd0e18ad6 from auth oi 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head(226987'10352362 client.361453256.0:153220873 dirty|data_digest|omap_digest s 4194304 uv 10352362 dd d0e18ad6 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445029 osd.242 [ERR] 2.fb9 soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head: failed to pick suitable auth object
2018-07-18 08:07:11.776308 osd.242 [ERR] 2.fb9 repair 213 errors, 0 fixed

please help with how to fix the problem?
tried a lot of stuff from the net but a lot of the information is not relevant on RBD/BlueStore with Luminous 12.2.2

if needed I will upload all the relevant logs for this case per request.

Thanks a lot.

Actions #1

Updated by Alon Avrahami almost 6 years ago

Alon Avrahami wrote:

Hi,

We have ceph cluster installed with Luminous 12.2.2 using bluestore.
All nodes are Intel servers with 1.6TB Intel SSDs running ubuntu 16.04.3/16.04.4

While adding new OSDs to the cluster (one by one) we noticed some PG inconsistent alert on all new OSDs.
tried to figure out why it happened and decided to remove them from the cluster till finding the solution. (Though they may be the problem, all 4 out of 4 new noded had inconsistent PGs in them)

The cluster status:

root@ecprdbcph10-opens:~# ceph -s
cluster:
id: 220fe94c-bd2a-475b-aeca-6ba8baaf67f3
health: HEALTH_ERR
2933 scrub errors
Possible data damage: 17 pgs inconsistent

services:
mon: 3 daemons, quorum ecprdbcph10-opens,ecprdbcph11-opens,ecprdbcph12-opens
mgr: ecprdbcph12-opens(active), standbys: ecprdbcph10-opens, ecprdbcph11-opens
osd: 320 osds: 319 up, 319 in

data:
pools: 5 pools, 8292 pgs
objects: 17894k objects, 71465 GB
usage: 205 TB used, 258 TB / 464 TB avail
pgs: 8274 active+clean
17 active+clean+inconsistent
1 active+clean+scrubbing

io:
client: 136 MB/s rd, 262 MB/s wr, 2567 op/s rd, 8593 op/s wr

ceph health detail:

HEALTH_ERR 2936 scrub errors; Possible data damage: 17 pgs inconsistent
OSD_SCRUB_ERRORS 2936 scrub errors
PG_DAMAGED Possible data damage: 17 pgs inconsistent
pg 2.199 is active+clean+inconsistent, acting [194,86,137]
pg 2.411 is active+clean+inconsistent, acting [16,283,294]
pg 2.442 is active+clean+inconsistent, acting [120,310,51]
pg 2.537 is active+clean+inconsistent, acting [235,119,40]
pg 2.732 is active+clean+inconsistent, acting [132,33,50]
pg 2.734 is active+clean+inconsistent, acting [88,316,237]
pg 2.7fc is active+clean+inconsistent, acting [126,134,7]
pg 2.923 is active+clean+inconsistent, acting [102,272,253]
pg 2.95b is active+clean+inconsistent, acting [102,149,119]
pg 2.966 is active+clean+inconsistent, acting [162,88,36]
pg 2.9b1 is active+clean+inconsistent, acting [5,42,303]
pg 2.9d9 is active+clean+inconsistent, acting [307,282,116]
pg 2.a10 is active+clean+inconsistent, acting [157,203,62]
pg 2.bb7 is active+clean+inconsistent, acting [12,152,104]
pg 2.c16 is active+clean+inconsistent, acting [154,164,81]
pg 2.cd7 is active+clean+inconsistent, acting [267,67,154]
pg 2.fb9 is active+clean+inconsistent, acting [242,132,7]

Tried to run "ceph pg repair 2.fb9" but unsuccessfully.
failed with failed to pick suitable auth object error.

log from /var/log/ceph/ceph.log:

2018-07-18 08:07:04.726417 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head data_digest 0x3d09131b != data_digest 0xd489ac9f from auth oi 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head(226987'10189895 client.392080848.0:105233287 dirty|data_digest|omap_digest s 4194304 uv 10189895 dd d489ac9f od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:04.726423 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head data_digest 0x3d09131b != data_digest 0xd489ac9f from auth oi 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head(226987'10189895 client.392080848.0:105233287 dirty|data_digest|omap_digest s 4194304 uv 10189895 dd d489ac9f od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:04.726429 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head data_digest 0x3d09131b != data_digest 0xd489ac9f from auth oi 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head(226987'10189895 client.392080848.0:105233287 dirty|data_digest|omap_digest s 4194304 uv 10189895 dd d489ac9f od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:04.726433 osd.242 [ERR] 2.fb9 soid 2:9dfee6ba:::rbd_data.5e55711aaec419.000000000000e5c6:head: failed to pick suitable auth object
2018-07-18 08:07:16.607945 mon.ecprdbcph10-opens [ERR] Health check update: 2933 scrub errors (OSD_SCRUB_ERRORS)
2018-07-18 08:07:16.608023 mon.ecprdbcph10-opens [ERR] Health check update: Possible data damage: 17 pgs inconsistent (PG_DAMAGED)
2018-07-18 08:07:09.800852 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head data_digest 0x892117c8 != data_digest 0xcd09aede from auth oi 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head(227188'10663601 client.377751309.0:125366300 dirty|data_digest|omap_digest s 4194304 uv 10663601 dd cd09aede od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:09.800856 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head data_digest 0x892117c8 != data_digest 0xcd09aede from auth oi 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head(227188'10663601 client.377751309.0:125366300 dirty|data_digest|omap_digest s 4194304 uv 10663601 dd cd09aede od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:09.800861 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head data_digest 0x892117c8 != data_digest 0xcd09aede from auth oi 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head(227188'10663601 client.377751309.0:125366300 dirty|data_digest|omap_digest s 4194304 uv 10663601 dd cd09aede od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:09.800864 osd.242 [ERR] 2.fb9 soid 2:9dffa823:::rbd_data.a3b4982ad78e9f.0000000000001209:head: failed to pick suitable auth object
2018-07-18 08:07:11.444986 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head data_digest 0x394343cd != data_digest 0x20969b49 from auth oi 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head(227178'10553559 client.480517048.0:161698532 dirty|data_digest|omap_digest s 4194304 uv 10553559 dd 20969b49 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.444998 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head data_digest 0x394343cd != data_digest 0x20969b49 from auth oi 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head(227178'10553559 client.480517048.0:161698532 dirty|data_digest|omap_digest s 4194304 uv 10553559 dd 20969b49 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445004 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head data_digest 0x394343cd != data_digest 0x20969b49 from auth oi 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head(227178'10553559 client.480517048.0:161698532 dirty|data_digest|omap_digest s 4194304 uv 10553559 dd 20969b49 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445009 osd.242 [ERR] 2.fb9 soid 2:9dffd85e:::rbd_data.8417691f2d7d97.0000000000000328:head: failed to pick suitable auth object
2018-07-18 08:07:11.445012 osd.242 [ERR] 2.fb9 shard 7: soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head data_digest 0xbae16ae3 != data_digest 0xd0e18ad6 from auth oi 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head(226987'10352362 client.361453256.0:153220873 dirty|data_digest|omap_digest s 4194304 uv 10352362 dd d0e18ad6 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445017 osd.242 [ERR] 2.fb9 shard 132: soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head data_digest 0xbae16ae3 != data_digest 0xd0e18ad6 from auth oi 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head(226987'10352362 client.361453256.0:153220873 dirty|data_digest|omap_digest s 4194304 uv 10352362 dd d0e18ad6 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445024 osd.242 [ERR] 2.fb9 shard 242: soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head data_digest 0xbae16ae3 != data_digest 0xd0e18ad6 from auth oi 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head(226987'10352362 client.361453256.0:153220873 dirty|data_digest|omap_digest s 4194304 uv 10352362 dd d0e18ad6 od ffffffff alloc_hint [4194304 4194304 0])
2018-07-18 08:07:11.445029 osd.242 [ERR] 2.fb9 soid 2:9dffdf2a:::rbd_data.a3ccc8231bf6f4.0000000000006bf3:head: failed to pick suitable auth object
2018-07-18 08:07:11.776308 osd.242 [ERR] 2.fb9 repair 213 errors, 0 fixed

please help with how to fix the problem?
tried a lot of stuff from the net but a lot of the information is not relevant on RBD/BlueStore with Luminous 12.2.2

if needed I will upload all the relevant logs for this case per request.

Thanks a lot.

After running deep scrub on each inconsistent PG I could get some information about the shards for each of the PGs.
it seems like all data digest is the same on all 3 copies on different OSDs, so there is no inconsistency between digests.

for example, here is the rados list-inconsistent-obj 2.fb9
https://pastebin.com/w8pGHRkQ

may be metadata issue/bug ?

Actions #2

Updated by Patrick Donnelly almost 6 years ago

  • Project changed from Ceph to RADOS
  • Status changed from New to Rejected

Please seek assistance for these kinds of issues on ceph-users mailing list.

Actions #3

Updated by Alon Avrahami almost 6 years ago

Patrick Donnelly wrote:

Please seek assistance for these kinds of issues on ceph-users mailing list.

Hi Patrick,

Thanks for the response.
I already talked with the guys at the IRC channel.
I think its better to mark this problem as a bug (solved) and not as support. It happend due to the installation of version 12.2.6

To solved it, i had to upgrade to version 12.2.7 which has the fix related the data_digest errors.

Just to let know the other people who may suffer from this too.

Actions

Also available in: Atom PDF