Project

General

Profile

Bug #21040

bluestore: multiple objects (clones?) referencing same blocks (on all replicas)

Added by Charles Alva over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi Ceph,

Been using Ceph Luminous from 12.0.x and running well. But since Ceph Luminous 12.1.3 and 12.1.4 I got 1 or 2 pgs with active+clean+inconsistent. Unfortunately, there's no official documentation how to do this on Bluestore. Some have inconsistency with all of the three peers got read errors, and some of them are having blank/null inconsistency such as below. Due to frustration and since this is not in production, and after deleting one by one but kept getting inconsistencies, I decided to remove all the inconsistent objects at once.

How to fix this kind of issue the right way and not losing data? How
ceph Bluestore handles HDD bad sectors, shouldn't it be automatically like old filestore? And is Ceph Bluestore has its own trim at runtime periodically to trim SSDs for keeping the write performance optimal?

# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
    pg 1.6 is active+clean+inconsistent, acting [3,4,1]
    pg 1.26 is active+clean+inconsistent, acting [1,4,3]

root@ceph:~# rados list-inconsistent-obj 1.6 --format=json-pretty
{
    "epoch": 648,
    "inconsistents": []
}

root@ceph:~# rados list-inconsistent-obj 1.26
{"epoch":648,"inconsistents":[]}root@ceph:~# rados list-inconsistent-obj 1.26 --format=json-pretty
{
    "epoch": 648,
    "inconsistents": []
}

root@ceph:~# ceph pg ls inconsistent
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES      LOG  DISK_LOG STATE                     STATE_STAMP                VERSION    REPORTED    UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           
1.6        1042                  0        0         0       0 4270715526 1523     1523 active+clean+inconsistent 2017-08-18 13:16:02.304114 650'540008 650:1063869 [3,4,1]          3 [3,4,1]              3 650'539996 2017-08-18 13:16:02.304074      644'539753 2017-08-18 10:48:55.007583 
1.26       1082                  0        0         0       0 4424603286 1559     1559 active+clean+inconsistent 2017-08-18 13:15:59.676249 650'463621 650:1073114 [1,4,3]          1 [1,4,3]              1 650'463554 2017-08-18 13:15:59.676214      644'462902 2017-08-18 10:51:48.518118 

root@ceph:~# rados ls -p rbd | grep -i rbd_data.196f8574b0dc51.0000000000000a32
rbd_data.196f8574b0dc51.0000000000000a32

root@ceph:~# rados ls -p rbd | grep -i rbd_data.196f8574b0dc51.0000000000000d71
rbd_data.196f8574b0dc51.0000000000000d71
root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.6 --op list rbd_data.196f8574b0dc51.0000000000000a32
["1.6",{"oid":"rbd_data.196f8574b0dc51.0000000000000a32","key":"","snapid":-2,"hash":2399151814,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.6 --op list rbd_data.196f8574b0dc51.0000000000000a32
["1.6",{"oid":"rbd_data.196f8574b0dc51.0000000000000a32","key":"","snapid":-2,"hash":2399151814,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.6 --op list rbd_data.196f8574b0dc51.0000000000000a32
["1.6",{"oid":"rbd_data.196f8574b0dc51.0000000000000a32","key":"","snapid":-2,"hash":2399151814,"max":0,"pool":1,"namespace":"","max":0}]

root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.26 --op list rbd_data.196f8574b0dc51.0000000000000d71
["1.26",{"oid":"rbd_data.196f8574b0dc51.0000000000000d71","key":"","snapid":-2,"hash":3960199526,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.26 --op list rbd_data.196f8574b0dc51.0000000000000d71
["1.26",{"oid":"rbd_data.196f8574b0dc51.0000000000000d71","key":"","snapid":-2,"hash":3960199526,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.26 --op list rbd_data.196f8574b0dc51.0000000000000d71
["1.26",{"oid":"rbd_data.196f8574b0dc51.0000000000000d71","key":"","snapid":-2,"hash":3960199526,"max":0,"pool":1,"namespace":"","max":0}]

root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.6 rbd_data.196f8574b0dc51.0000000000000a32 removeall
root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.26 rbd_data.196f8574b0dc51.0000000000000d71 removeall
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.6 rbd_data.196f8574b0dc51.0000000000000a32 removeall
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.26 rbd_data.196f8574b0dc51.0000000000000d71 removeall
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.6 rbd_data.196f8574b0dc51.0000000000000a32 removeall
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.26 rbd_data.196f8574b0dc51.0000000000000d71 removeall

2017-08-18 07:10:07.277119 7f9b955df700  0 log_channel(cluster) log [DBG] : 1.6 repair starts2017-08-18 07:11:32.119052 7f9b955df700 -1 log_channel(cluster) log [ERR] : 1.6 repair stat mismatch, got 1042/1043 objects, 16/16 clones, 1042/1043 dirty, 1/1 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 4266668678/4270862982 bytes, 0/0 hit_set_archive bytes.
2017-08-18 07:11:32.119138 7f9b955df700 -1 log_channel(cluster) log [ERR] : 1.6 repair 1 errors, 1 fixed

2017-08-18 07:11:32.321422 7ff8d90c9700  0 log_channel(cluster) log [DBG] : 1.26 repair starts2017-08-18 07:13:01.834640 7ff8d90c9700 -1 log_channel(cluster) log [ERR] : 1.26 repair stat mismatch, got 1081/1082 objects, 24/24 clones, 1081/1082 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 4420490902/4424685206 bytes, 0/0 hit_set_archive bytes.
2017-08-18 07:13:01.834743 7ff8d90c9700 -1 log_channel(cluster) log [ERR] : 1.26 repair 1 errors, 1 fixed

root@ceph:~# ceph health detail
HEALTH_OK

Kind regards,

Charles Alva

History

#1 Updated by Edward Huyer over 6 years ago

I believe I'm also encountering this issue. Here's the root of the mailing list thread discussing my issue: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-August/020204.html

The output of running ceph-bluestore-tool fsck on the primary OSD: https://pastebin.com/nZ0H5ag3

#2 Updated by Brad Hubbard over 6 years ago

  • Project changed from Ceph to RADOS
  • Priority changed from Normal to High
  • Source set to Community (user)
  • Component(RADOS) BlueStore added

#4 Updated by Edward Huyer over 6 years ago

Got another inconsistent that may or may not be related. It just cropped up overnight. "rados list-inconsistent-obj" actually gives an output on this one (pasted below). Nothing in the logs that I can find. A repair corrected it. I didn't think to run a fsck on one of the osds until I already did the repair.

[root@hydra4 ~]# rados list-inconsistent-obj 9.37 --format=json-pretty
{
    "epoch": 68096,
    "inconsistents": [
        {
            "object": {
                "name": "rbd_data.33992ae8944a.0000000000002001",
                "nspace": "",
                "locator": "",
                "snap": 14,
                "version": 1461996
            },
            "errors": [],
            "union_shard_errors": [
                "read_error" 
            ],
            "selected_object_info": "9:ecce40ee:::rbd_data.33992ae8944a.0000000000002001:e(68025'2454038 osd.9.0:34400 dirty|data_digest|omap_digest s 4194304 uv 1461996 dd 43d61c5d od ffffffff alloc_hint [0 0 0])",
            "shards": [
                {
                    "osd": 42,
                    "errors": [
                        "read_error" 
                    ],
                    "size": 4194304
                },
                {
                    "osd": 51,
                    "errors": [],
                    "size": 4194304,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0x43d61c5d" 
                },
                {
                    "osd": 61,
                    "errors": [
                        "read_error" 
                    ],
                    "size": 4194304
                }
            ]
        }
    ]
}

#5 Updated by Brad Hubbard over 6 years ago

I notice trim mentioned. Are both/any of you somehow manually running trim on these devices?

#6 Updated by Charles Alva over 6 years ago

Brad Hubbard wrote:

I notice trim mentioned. Are both/any of you somehow manually running trim on these devices?

Not in dev environment (Luminous), but yes we are using NVME as Ceph Jewel journal in production.

I use old HDDs (more than 2 years old) to test Ceph Luminous RC and almost all of them have some kind of raw read errors or bad sectors when Ceph Luminous performs deep scrub.

That's why I'm asking how Bluestore handle disk bad sectors and SSD trim or bluestore support periodically fstrim at runtime.

#7 Updated by Edward Huyer over 6 years ago

No trim for me, unless RHEL7 is doing something dumb without my knowledge. All my OSDs are 100% spinning rust. The disks are also brand new.

#8 Updated by Edward Huyer over 6 years ago

Also, another inconsistent pg this morning.

[root@hydra4 ~]# rados list-inconsistent-obj 9.26 --format=json-pretty
{
    "epoch": 68115,
    "inconsistents": [
        {
            "object": {
                "name": "rbd_data.33992ae8944a.0000000000002007",
                "nspace": "",
                "locator": "",
                "snap": 14,
                "version": 0
            },
            "errors": [],
            "union_shard_errors": [
                "read_error" 
            ],
            "shards": [
                {
                    "osd": 29,
                    "errors": [
                        "read_error" 
                    ],
                    "size": 4194304
                },
                {
                    "osd": 47,
                    "errors": [
                        "read_error" 
                    ],
                    "size": 4194304
                },
                {
                    "osd": 64,
                    "errors": [
                        "read_error" 
                    ],
                    "size": 4194304
                }
            ]
        }
    ]
}

ceph-bluestore-tool fsck on OSD 47: https://pastebin.com/dSG6z3eh

Repair is unsuccessful, and after the fsck and repair the output for list-inconsistent-obj is as follows:

[root@hydra4 ~]# rados list-inconsistent-obj 9.26 --format=json-pretty
{
    "epoch": 68132,
    "inconsistents": []
}

#9 Updated by Brad Hubbard over 6 years ago

It looks like this could be a problem with the allocator. I'm looking into whether we can switch allocators as a test.

#10 Updated by Charles Alva over 6 years ago

Brad Hubbard wrote:

It looks like this could be a problem with the allocator. I'm looking into whether we can switch allocators as a test.

Thanks for looking into this, Brad.

Is the data still safe? Or is it really corrupted?

I have experienced some pg inconsistencies with read errors across all OSDs like Edward's as well. Bluestore fsck and ceph pg repair could not fix them.

I had to delete the rbd image which contained the problematic pgs and restore it from backup. Ceph has been running fine since then.

#11 Updated by Brad Hubbard over 6 years ago

Charles Alva wrote:

Is the data still safe? Or is it really corrupted?

Still looking into the answer to this.

#12 Updated by Brad Hubbard over 6 years ago

In order to test whether the bitmap allocator is the culprit here you can add the following to ceph.conf and restart the OSDs.

bluefs_allocator = stupid

This is the default allocator going forward anyway so it would be a good data point to see whether the issue recurs using this allocator.

#13 Updated by Charles Alva over 6 years ago

Brad Hubbard wrote:

In order to test whether the bitmap allocator is the culprit here you can add the following to ceph.conf and restart the OSDs.

bluefs_allocator = stupid

This is the default allocator going forward anyway so it would be a good data point to see whether the issue recurs using this allocator.

Noted. Will try this today. Thanks!

#14 Updated by Sage Weil over 6 years ago

  • Subject changed from How to fix Ceph Luminous 12.1.4 pg active+clean+inconsistent on Bluestore? to bluestore: multiple objects (clones?) referencing same blocks (on all replicas)
  • Status changed from New to Need More Info

The fact that the same range on the object had a bad csum on all 3 replicas suggests a (reproducible) bluestore bug.

If you are able to reproduce this, I would love to see a debug bluestore = 20 log of fsck output:

CEPH_ARGS="--log-file c --debug-bluestore 20" ceph-objectstore-tool --op fsck --data-path /var/lib/ceph/osd/ceph-NNN

Thank you!

#15 Updated by Charles Alva over 6 years ago

Sage Weil wrote:

The fact that the same range on the object had a bad csum on all 3 replicas suggests a (reproducible) bluestore bug.

If you are able to reproduce this, I would love to see a debug bluestore = 20 log of fsck output:
[...]
Thank you!

Thanks Sage. Will do when hit this error again in the future.

Just upgraded to 12.2.0 flawlessly. Congrats!

#16 Updated by Edward Huyer over 6 years ago

I haven't seen any new inconsistencies since I switched to the "stupid" bluefs allocator Sunday night.

I'll see if I can get a debug bluestore = 20 output on one of my existing problem pgs some time soon.

#17 Updated by Brad Hubbard over 6 years ago

  • Assignee set to Brad Hubbard
  • ceph-qa-suite rados added

Excellent Edward, thanks for the update.

#18 Updated by Nicolas Drufin over 6 years ago

I have the same problem, my ceph-cluster as 3 pgs active+clean+inconsistent. When I try ceph pg repair, pg status add scrub+deep+repair but after 5s do nothings.
When I tail -f /var/log/ceph/ceph-osd.4.log :

2017-09-04 14:16:38.709553 7ff71cd01700  0 log_channel(cluster) log [INF] : 6.c4 repair starts
2017-09-04 14:16:38.783564 7ff71cd01700 -1 bluestore(/var/lib/ceph/osd/ceph-4) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xe6d994d3, expected 0xca996b60, device location [0x75a160000~1000], logical extent 0x0~1000, object #6:2301a10f:::rbd_data.5bb7174b0dc51.000000000000da0a:17#
2017-09-04 14:16:39.879037 7ff71cd01700 -1 log_channel(cluster) log [ERR] : 6.c4 soid 6:2301a10f:::rbd_data.5bb7174b0dc51.000000000000da0a:17: failed to pick suitable object info
2017-09-04 14:16:58.636751 7ff71cd01700 -1 log_channel(cluster) log [ERR] : 6.c4 repair 2 errors, 0 fixed

When I rados list-inconsistent-obj 6.c4 --format=json-pretty :
{
    "epoch": 2295,
    "inconsistents": []
}

I try to add bluefs allocator = stupid in /etc/ceph/ceph.conf (common group) and restart all osd on all node but nothing change. I can run specific fsck command if I can contribute to resolve this bug.

#19 Updated by Brad Hubbard over 6 years ago

Nicolas,

Could you run the command Sage posted in comment #14?

Changing the allocator will not repair an existing issue AIUI but should stop further issues happening as there is evidence to suggest the bug is in the bitmap allocator.

#20 Updated by Nicolas Drufin over 6 years ago

I have run the command but log file is 500Mo size, how can I send you easily ?

#21 Updated by Brad Hubbard over 6 years ago

http://docs.ceph.com/docs/master/man/8/ceph-post-file/

Let us know the identifier (tag) when it completes (and be sure to compress it of course).

#22 Updated by Nicolas Drufin over 6 years ago

You can found it here : ceph-post-file: 258cf93d-a195-4996-af73-dd88aaf858cf
Thanks for your attention.

#23 Updated by Nicolas Drufin over 6 years ago

Since I have destroy osd and recreate with filestore but active+clean+inconsistent persist. In ceph-osd.<osd-num>.log I see a problem with clone missing, so I use command with <objectid> found in log file:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-<osd-num>/ <objectid> remove-clone-metadata <cloneid>

Note : I do not know why but <cloneid> is different from log file so I use same command with dump instead of remove-clone-metadata <cloneid> to request this id.
Need stop osd during operation and repair after restarted.

#24 Updated by Sage Weil over 6 years ago

Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)

Thanks!

#25 Updated by Charles Alva over 6 years ago

Sage Weil wrote:

Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)

Thanks!

Hi Sage,

The BlueStore version was from the first 12.1.x RC and upgraded to every minor version up to 12.1.4 RC which produced the errors. then upgraded to the latest 12.2.0 Stable.

Mostly RBD workload from the start with RBD snapshots invoked by qemu occasionally. Started from 12.1.4 RC, I deployed CephFS to store the VMs' image backups and ISO images.

I remarked the "bluefs_allocator = stupid" in ceph.conf when upgrading to 12.2.0 Stable and I haven't encountered any error since deleting and restoring the RBD from VM image backup 9 days ago.

#26 Updated by Nicolas Drufin over 6 years ago

Sage Weil wrote:

Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)

Thanks!

My ceph version is 12.1.2 and we use it for RBD to store lxc container with Proxmox. Incidents have been encountered when we have launched backup task on big lxc (up to 100Go) with proxmox and this has generated side effect on ceph cluster.

Now I remove clone metadata, ceph cluster work well.

#27 Updated by Edward Huyer over 6 years ago

Sage Weil wrote:

Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)

I started using Bluestore with Luminous 12.1.0. The workload is 100% RBD on replicated pools, with a mixture of kernel driver access and kvm/libvirt access via Proxmox 5. Proxmox is accessing one pool, while kernel access is to two others. Snapshots are an occasional thing, but not on a routine basis.

A few notes: I migrated a substantial amount of data (>60T) from old filestore OSDs to new Bluestore OSDs with no apparent issues. It was only later that the inconsistencies started cropping up. Further, it seems like the inconsistencies only appeared in the pool Proxmox accesses, though that may have simply been because that was the pool with the most traffic at the time.

#28 Updated by Brad Hubbard over 6 years ago

Ed, Charles, Nicolas,

So all three of you are using Proxmox? Could you let us know if you are currently using, or have used, Proxmox ceph packages (as opposed to packages released by the ceph project itself)?

#29 Updated by Nicolas Drufin over 6 years ago

I use ceph with proxmox 5 packages : ceph version 12.1.2 (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc) with pve-manager/5.0-30/5ab26bc (running kernel: 4.10.17-2-pve)

#30 Updated by Charles Alva over 6 years ago

Brad Hubbard wrote:

Ed, Charles, Nicolas,

So all three of you are using Proxmox? Could you let us know if you are currently using, or have used, Proxmox ceph packages (as opposed to packages released by the ceph project itself)?

Hi Brad,

Yes, I'm using Proxmox VE 5.0 (based on Debian 9.1) but with Ceph upstream packages. Didn't use the Proxmox Ceph packages as we have to manually deploy CephFS and they do not support ceph-deploy.

#31 Updated by Sage Weil over 6 years ago

Ok, unless you've seen new inconsistencies appear since 12.2.0, I think this is from #20983, fixed by d5ba7061ee588c232138af1d880faf09be4adeed, which appeared (backported) in 12.1.4.

We need to build a repair mode for bluestore fsck that can clean up inconsistencies like this (i'm sure we'll get future bugs that do similar damage).

In the meantime, are you folks able to sit tight with the current inconsistencies?

#32 Updated by Charles Alva over 6 years ago

I'm fine with it. Thanks for looking into this, Sage.

#33 Updated by Edward Huyer over 6 years ago

I can sit tight for a while if necessary. That said, if there is a straightforward and reliable way to fix it now (e.g., by destroying and recreating the affected OSDs one at a time), I'd prefer that, just to clear the cluster ERR state and stop the accumulation of old monitor maps.

Either way, thank you all for the help.

#34 Updated by WANG Guoqin over 6 years ago

Got a lot of inconsistent pgs spread out in all 19 OSDs of mine every time when I do deep scrub. Without deep scrub, or only doing scrub won't find them. Most of them are read_error but some others are shown "nothing inconsistent" in list-inconsistent-obj. (Right now all errors I can find are read_error.)

There were few at the beginning, and become more and more. Doing a deep-scrub on all OSD and repair all inconsistent ones seem to fix the problem, but another immediate deep-scrub will find some other ones, more or less, mostly not the same pgs as the last time. For example from last deep-scrub I did half an hour ago there were 43 inconsistent objs found, and this time there are 14.

ceph-bluestore-tool fsck doesn't fix the problems. And what's worse, sometimes they brought down osds. An osd may suddenly fail to start, with

bluestore(/var/lib/ceph/osd/ceph-13) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, ...
and
osd.13 0 OSD::init() : unable to read osd superblock
Neither was ceph-bluestore-tool fsck able to fix that, and the only thing I could do was purge and recreate the OSD. Hopefully two OSDs don't fail at the same time, and this doesn't result in data loss, but I'm not sure about that.

The cluster was rebuilt after the rescue from last cluster which was created quite a long time ago. This kind of errors happened from last cluster, and I decided to tar all the files from that, recreate a new one, the extract the files. The new cluster was created a week ago with 12.2.1. By that time stupid allocator should have become default I think.

I'm using cephfs on the cluster, with multiple MDS, and I'm not using Proxmox.

#35 Updated by Brad Hubbard over 6 years ago

Guoqin,

Could you run the command Sage posted in comment #14 as well as the following?

ceph daemon osd.N config show|grep allocator

Where 'N' is the number of one of the osds local to the machine.

#36 Updated by WANG Guoqin over 6 years ago

Brad Hubbard wrote:

Guoqin,

Could you run the command Sage posted in comment #14 as well as the following?

[...]

Where 'N' is the number of one of the osds local to the machine.

$ sudo ceph daemon osd.9 config show |grep allocator
"bluefs_allocator": "stupid",
"bluestore_allocator": "stupid",
"bluestore_bitmapallocator_blocks_per_zone": "1024",
"bluestore_bitmapallocator_span_size": "1024",

#37 Updated by WANG Guoqin over 6 years ago

And some other results,

$ sudo ceph pg dump |grep inconsistent
dumped all
1.3c6       423                  0        0         0       0 19757162 1507     1507 active+clean+inconsistent 2017-11-22 14:12:27.889745  7289'1725   7776:19232  [2,10,16]          2  [2,10,16]              2  7289'1725 2017-11-22 14:12:27.889057       7289'1725 2017-11-22 14:12:27.889057 
1.3a1       467                  0        0         0       0 54700693 1559     1559 active+clean+inconsistent 2017-11-22 14:04:34.226580  7736'1659   7776:10068    [3,6,9]          3    [3,6,9]              3  7736'1659 2017-11-22 14:04:34.224386       7736'1659 2017-11-22 14:04:34.224386 
1.34f       470                  0        0         0       0 19345117 1532     1532 active+clean+inconsistent 2017-11-22 13:59:48.431118  7283'1732   7776:19062  [4,18,13]          4  [4,18,13]              4  7283'1732 2017-11-22 13:59:48.430295       7283'1732 2017-11-22 13:59:48.430295 
1.310       492                  0        0         0       0 43605779 1514     1514 active+clean+inconsistent 2017-11-22 14:11:42.216931  7279'1814   7776:11269   [16,7,4]         16   [16,7,4]             16  7279'1814 2017-11-22 14:11:42.215517       7279'1814 2017-11-22 14:11:42.215517 
1.2f0       523                  0        0         0       0 38966027 1510     1510 active+clean+inconsistent 2017-11-22 14:05:29.599383  7736'1860   7776:15443   [2,11,6]          2   [2,11,6]              2  7736'1860 2017-11-22 14:05:29.598908       7736'1860 2017-11-22 14:05:29.598908 
1.124       464                  0        0         0       0 51081711 1526     1526 active+clean+inconsistent 2017-11-22 14:04:23.364628  7776'1671   7776:16745  [11,8,18]         11  [11,8,18]             11  7776'1671 2017-11-22 14:04:23.363316       7776'1671 2017-11-22 14:04:23.363316 
1.df        470                  0        0         0       0 24753615 1538     1538 active+clean+inconsistent 2017-11-22 14:03:03.474101  7287'1738   7776:25925    [9,4,3]          9    [9,4,3]              9  7287'1738 2017-11-22 14:03:03.473501       7287'1738 2017-11-22 14:03:03.473501 
1.dc        468                  0        0         0       0 50180203 1557     1557 active+clean+inconsistent 2017-11-22 14:06:02.745203  7276'1857   7776:16573    [6,1,4]          6    [6,1,4]              6  7276'1857 2017-11-22 14:06:02.744851       7276'1857 2017-11-22 14:06:02.744851 
1.bd        437                  0        0         0       0 36415780 1501     1501 active+clean+inconsistent 2017-11-22 14:07:47.460952  7771'1701   7776:10279   [14,6,4]         14   [14,6,4]             14  7771'1701 2017-11-22 14:07:47.460633       7771'1701 2017-11-22 14:07:47.460633 
1.80        513                  0        0         0       0 38691491 1587     1587 active+clean+inconsistent 2017-11-22 13:59:24.362883 7287'16287   7776:44532  [15,10,6]         15  [15,10,6]             15 7287'16287 2017-11-22 13:59:24.361918      7287'16287 2017-11-22 13:59:24.361918 
1.233       514                  0        0         0       0 40154691 1512     1512 active+clean+inconsistent 2017-11-22 14:01:26.236240  7776'1883   7776:21383 [13,10,17]         13 [13,10,17]             13  7776'1883 2017-11-22 14:01:26.235788       7776'1883 2017-11-22 14:01:26.235788 
1.254       491                  0        0         0       0 59784845 1592     1592 active+clean+inconsistent 2017-11-22 14:05:07.920558  7736'1792   7776:16344   [0,18,9]          0   [0,18,9]              0  7736'1792 2017-11-22 14:05:07.918659       7736'1792 2017-11-22 14:05:07.918659 

$ for item in $(sudo ceph pg dump |grep inconsistent |awk '{print $1}') ; do sudo rados list-inconsistent-obj $item ; echo -e ; done
dumped all
{"epoch":7702,"inconsistents":[{"object":{"name":"1000004478f.00000000","nspace":"","locator":"","snap":"head","version":545},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:63ea9188:::1000004478f.00000000:head(305'545 mds.0.218:14115 dirty|data_digest|omap_digest s 527649 uv 545 dd a703c9c1 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":2,"primary":true,"errors":[],"size":527649,"omap_digest":"0xffffffff","data_digest":"0xa703c9c1"},{"osd":10,"primary":false,"errors":["read_error"],"size":527649},{"osd":16,"primary":false,"errors":[],"size":527649,"omap_digest":"0xffffffff","data_digest":"0xa703c9c1"}]}]}
{"epoch":7726,"inconsistents":[{"object":{"name":"1000001e697.00000095","nspace":"","locator":"","snap":"head","version":150},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:85de8c0d:::1000001e697.00000095:head(101'150 client.24357.0:124811 dirty|data_digest|omap_digest s 4194304 uv 150 dd 1c823a35 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":3,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x1c823a35"},{"osd":6,"primary":false,"errors":["read_error"],"size":4194304},{"osd":9,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x1c823a35"}]}]}
{"epoch":7746,"inconsistents":[{"object":{"name":"60000004e8f.00000008","nspace":"","locator":"","snap":"head","version":965},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:f2d14bb1:::60000004e8f.00000008:head(346'965 client.104221.0:1798817 dirty|data_digest|omap_digest s 4194304 uv 965 dd 9dfad556 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":4,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x9dfad556"},{"osd":13,"primary":false,"errors":["read_error"],"size":4194304},{"osd":18,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x9dfad556"}]}]}
{"epoch":7758,"inconsistents":[{"object":{"name":"30000006eb5.0000000b","nspace":"","locator":"","snap":"head","version":1180},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:08df7d63:::30000006eb5.0000000b:head(352'1180 client.54211.0:165182 dirty|data_digest|omap_digest s 4194304 uv 1180 dd 5e336f88 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5e336f88"},{"osd":7,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5e336f88"},{"osd":16,"primary":true,"errors":["read_error"],"size":4194304}]}]}
{"epoch":7702,"inconsistents":[{"object":{"name":"10000059133.0000005d","nspace":"","locator":"","snap":"head","version":1860},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:0f763fe2:::10000059133.0000005d:head(7736'1860 client.24755.0:43639 dirty|data_digest|omap_digest s 4194304 uv 1860 dd dfa8c5c8 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":2,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xdfa8c5c8"},{"osd":6,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xdfa8c5c8"},{"osd":11,"primary":false,"errors":["read_error"],"size":4194304}]}]}
{"epoch":7758,"inconsistents":[{"object":{"name":"1000001e6b4.00000054","nspace":"","locator":"","snap":"head","version":180},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:24adc20a:::1000001e6b4.00000054:head(104'180 client.24357.0:127005 dirty|data_digest|omap_digest s 4194304 uv 180 dd 83c1b193 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":8,"primary":false,"errors":["read_error"],"size":4194304},{"osd":11,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x83c1b193"},{"osd":18,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x83c1b193"}]}]}
{"epoch":7726,"inconsistents":[{"object":{"name":"30000031b2a.000000df","nspace":"","locator":"","snap":"head","version":1727},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:fb3795f1:::30000031b2a.000000df:head(5264'1727 client.54247.0:846 dirty|data_digest|omap_digest s 4194304 uv 1727 dd f4b391b0 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":3,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xf4b391b0"},{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xf4b391b0"},{"osd":9,"primary":true,"errors":["read_error"],"size":4194304}]}]}
{"epoch":5746,"inconsistents":[{"object":{"name":"60000004e8f.0000019f","nspace":"","locator":"","snap":"head","version":1080},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:3b3aa48a:::60000004e8f.0000019f:head(346'1080 client.104221.0:1799645 dirty|data_digest|omap_digest s 4194304 uv 1080 dd 5a5d4280 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":1,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5a5d4280"},{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5a5d4280"},{"osd":6,"primary":true,"errors":["read_error"],"size":4194304}]}]}
{"epoch":5746,"inconsistents":[{"object":{"name":"10000059120.00000001","nspace":"","locator":"","snap":"head","version":1426},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:bd1fe198:::10000059120.00000001:head(673'1427 osd.14.0:5186 dirty|data_digest|omap_digest s 4194304 uv 1426 dd 95e2dff6 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x95e2dff6"},{"osd":6,"primary":false,"errors":["read_error"],"size":4194304},{"osd":14,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x95e2dff6"}]}]}
{"epoch":7377,"inconsistents":[{"object":{"name":"1000005911e.00000044","nspace":"","locator":"","snap":"head","version":705},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:01206230:::1000005911e.00000044:head(654'1522 osd.15.0:505 dirty|data_digest|omap_digest s 4194304 uv 705 dd 582b1f11 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":6,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x582b1f11"},{"osd":10,"primary":false,"errors":["read_error"],"size":4194304},{"osd":15,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x582b1f11"}]}]}
{"epoch":7758,"inconsistents":[{"object":{"name":"3000000937a.00000000","nspace":"","locator":"","snap":"head","version":1224},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:cc785be8:::3000000937a.00000000:head(352'1224 mds.2.161:96962 dirty|data_digest|omap_digest s 4194304 uv 1224 dd a97350b1 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":10,"primary":false,"errors":["read_error"],"size":4194304},{"osd":13,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xa97350b1"},{"osd":17,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xa97350b1"}]}]}
{"epoch":7726,"inconsistents":[{"object":{"name":"1000001e6c6.0000005d","nspace":"","locator":"","snap":"head","version":195},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:2a514ce6:::1000001e6c6.0000005d:head(111'195 client.24357.0:128927 dirty|data_digest|omap_digest s 4194304 uv 195 dd af2b24d2 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":0,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xaf2b24d2"},{"osd":9,"primary":false,"errors":["read_error"],"size":4194304},{"osd":18,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xaf2b24d2"}]}]}

#38 Updated by Brad Hubbard over 6 years ago

Guoqin,

Note: Your issue appears to be different in that you don't seem to have any pgs where all replicas are showing read errors.

Could you run the command Sage posted in comment #14 on one of the OSDs showing a read error as well as the following?

$ ceph report

Could you capture deep scrub logs with "debug_osd 20" and "debug-bluestore 20"?

Please upload the results either by compressing then and attaching them here or by using ceph-post-file.

#39 Updated by Greg Farnum over 6 years ago

  • Project changed from RADOS to bluestore

#40 Updated by Sage Weil about 6 years ago

  • Status changed from Need More Info to Resolved

The original bug here is fixed. Meanwhile, Igor is working on a repair function for ceph-bluestore-tool that will correct damaged osds; see https://github.com/ceph/ceph/pull/19843

Also available in: Atom PDF