Bug #21040
bluestore: multiple objects (clones?) referencing same blocks (on all replicas)
0%
Description
Hi Ceph,
Been using Ceph Luminous from 12.0.x and running well. But since Ceph Luminous 12.1.3 and 12.1.4 I got 1 or 2 pgs with active+clean+inconsistent. Unfortunately, there's no official documentation how to do this on Bluestore. Some have inconsistency with all of the three peers got read errors, and some of them are having blank/null inconsistency such as below. Due to frustration and since this is not in production, and after deleting one by one but kept getting inconsistencies, I decided to remove all the inconsistent objects at once.
How to fix this kind of issue the right way and not losing data? How
ceph Bluestore handles HDD bad sectors, shouldn't it be automatically like old filestore? And is Ceph Bluestore has its own trim at runtime periodically to trim SSDs for keeping the write performance optimal?
# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
pg 1.6 is active+clean+inconsistent, acting [3,4,1]
pg 1.26 is active+clean+inconsistent, acting [1,4,3]
root@ceph:~# rados list-inconsistent-obj 1.6 --format=json-pretty
{
"epoch": 648,
"inconsistents": []
}
root@ceph:~# rados list-inconsistent-obj 1.26
{"epoch":648,"inconsistents":[]}root@ceph:~# rados list-inconsistent-obj 1.26 --format=json-pretty
{
"epoch": 648,
"inconsistents": []
}
root@ceph:~# ceph pg ls inconsistent
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
1.6 1042 0 0 0 0 4270715526 1523 1523 active+clean+inconsistent 2017-08-18 13:16:02.304114 650'540008 650:1063869 [3,4,1] 3 [3,4,1] 3 650'539996 2017-08-18 13:16:02.304074 644'539753 2017-08-18 10:48:55.007583
1.26 1082 0 0 0 0 4424603286 1559 1559 active+clean+inconsistent 2017-08-18 13:15:59.676249 650'463621 650:1073114 [1,4,3] 1 [1,4,3] 1 650'463554 2017-08-18 13:15:59.676214 644'462902 2017-08-18 10:51:48.518118
root@ceph:~# rados ls -p rbd | grep -i rbd_data.196f8574b0dc51.0000000000000a32
rbd_data.196f8574b0dc51.0000000000000a32
root@ceph:~# rados ls -p rbd | grep -i rbd_data.196f8574b0dc51.0000000000000d71
rbd_data.196f8574b0dc51.0000000000000d71
root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.6 --op list rbd_data.196f8574b0dc51.0000000000000a32
["1.6",{"oid":"rbd_data.196f8574b0dc51.0000000000000a32","key":"","snapid":-2,"hash":2399151814,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.6 --op list rbd_data.196f8574b0dc51.0000000000000a32
["1.6",{"oid":"rbd_data.196f8574b0dc51.0000000000000a32","key":"","snapid":-2,"hash":2399151814,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.6 --op list rbd_data.196f8574b0dc51.0000000000000a32
["1.6",{"oid":"rbd_data.196f8574b0dc51.0000000000000a32","key":"","snapid":-2,"hash":2399151814,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.26 --op list rbd_data.196f8574b0dc51.0000000000000d71
["1.26",{"oid":"rbd_data.196f8574b0dc51.0000000000000d71","key":"","snapid":-2,"hash":3960199526,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.26 --op list rbd_data.196f8574b0dc51.0000000000000d71
["1.26",{"oid":"rbd_data.196f8574b0dc51.0000000000000d71","key":"","snapid":-2,"hash":3960199526,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.26 --op list rbd_data.196f8574b0dc51.0000000000000d71
["1.26",{"oid":"rbd_data.196f8574b0dc51.0000000000000d71","key":"","snapid":-2,"hash":3960199526,"max":0,"pool":1,"namespace":"","max":0}]
root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.6 rbd_data.196f8574b0dc51.0000000000000a32 removeall
root@proxmox1:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-1/ --pgid 1.26 rbd_data.196f8574b0dc51.0000000000000d71 removeall
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.6 rbd_data.196f8574b0dc51.0000000000000a32 removeall
root@proxmox2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-3/ --pgid 1.26 rbd_data.196f8574b0dc51.0000000000000d71 removeall
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.6 rbd_data.196f8574b0dc51.0000000000000a32 removeall
root@proxmox3:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-4/ --pgid 1.26 rbd_data.196f8574b0dc51.0000000000000d71 removeall
2017-08-18 07:10:07.277119 7f9b955df700 0 log_channel(cluster) log [DBG] : 1.6 repair starts2017-08-18 07:11:32.119052 7f9b955df700 -1 log_channel(cluster) log [ERR] : 1.6 repair stat mismatch, got 1042/1043 objects, 16/16 clones, 1042/1043 dirty, 1/1 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 4266668678/4270862982 bytes, 0/0 hit_set_archive bytes.
2017-08-18 07:11:32.119138 7f9b955df700 -1 log_channel(cluster) log [ERR] : 1.6 repair 1 errors, 1 fixed
2017-08-18 07:11:32.321422 7ff8d90c9700 0 log_channel(cluster) log [DBG] : 1.26 repair starts2017-08-18 07:13:01.834640 7ff8d90c9700 -1 log_channel(cluster) log [ERR] : 1.26 repair stat mismatch, got 1081/1082 objects, 24/24 clones, 1081/1082 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 4420490902/4424685206 bytes, 0/0 hit_set_archive bytes.
2017-08-18 07:13:01.834743 7ff8d90c9700 -1 log_channel(cluster) log [ERR] : 1.26 repair 1 errors, 1 fixed
root@ceph:~# ceph health detail
HEALTH_OK
Kind regards,
Charles Alva
History
#1 Updated by Edward Huyer over 6 years ago
I believe I'm also encountering this issue. Here's the root of the mailing list thread discussing my issue: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-August/020204.html
The output of running ceph-bluestore-tool fsck on the primary OSD: https://pastebin.com/nZ0H5ag3
#2 Updated by Brad Hubbard over 6 years ago
- Project changed from Ceph to RADOS
- Priority changed from Normal to High
- Source set to Community (user)
- Component(RADOS) BlueStore added
#4 Updated by Edward Huyer over 6 years ago
Got another inconsistent that may or may not be related. It just cropped up overnight. "rados list-inconsistent-obj" actually gives an output on this one (pasted below). Nothing in the logs that I can find. A repair corrected it. I didn't think to run a fsck on one of the osds until I already did the repair.
[root@hydra4 ~]# rados list-inconsistent-obj 9.37 --format=json-pretty { "epoch": 68096, "inconsistents": [ { "object": { "name": "rbd_data.33992ae8944a.0000000000002001", "nspace": "", "locator": "", "snap": 14, "version": 1461996 }, "errors": [], "union_shard_errors": [ "read_error" ], "selected_object_info": "9:ecce40ee:::rbd_data.33992ae8944a.0000000000002001:e(68025'2454038 osd.9.0:34400 dirty|data_digest|omap_digest s 4194304 uv 1461996 dd 43d61c5d od ffffffff alloc_hint [0 0 0])", "shards": [ { "osd": 42, "errors": [ "read_error" ], "size": 4194304 }, { "osd": 51, "errors": [], "size": 4194304, "omap_digest": "0xffffffff", "data_digest": "0x43d61c5d" }, { "osd": 61, "errors": [ "read_error" ], "size": 4194304 } ] } ] }
#5 Updated by Brad Hubbard over 6 years ago
I notice trim mentioned. Are both/any of you somehow manually running trim on these devices?
#6 Updated by Charles Alva over 6 years ago
Brad Hubbard wrote:
I notice trim mentioned. Are both/any of you somehow manually running trim on these devices?
Not in dev environment (Luminous), but yes we are using NVME as Ceph Jewel journal in production.
I use old HDDs (more than 2 years old) to test Ceph Luminous RC and almost all of them have some kind of raw read errors or bad sectors when Ceph Luminous performs deep scrub.
That's why I'm asking how Bluestore handle disk bad sectors and SSD trim or bluestore support periodically fstrim at runtime.
#7 Updated by Edward Huyer over 6 years ago
No trim for me, unless RHEL7 is doing something dumb without my knowledge. All my OSDs are 100% spinning rust. The disks are also brand new.
#8 Updated by Edward Huyer over 6 years ago
Also, another inconsistent pg this morning.
[root@hydra4 ~]# rados list-inconsistent-obj 9.26 --format=json-pretty { "epoch": 68115, "inconsistents": [ { "object": { "name": "rbd_data.33992ae8944a.0000000000002007", "nspace": "", "locator": "", "snap": 14, "version": 0 }, "errors": [], "union_shard_errors": [ "read_error" ], "shards": [ { "osd": 29, "errors": [ "read_error" ], "size": 4194304 }, { "osd": 47, "errors": [ "read_error" ], "size": 4194304 }, { "osd": 64, "errors": [ "read_error" ], "size": 4194304 } ] } ] }
ceph-bluestore-tool fsck on OSD 47: https://pastebin.com/dSG6z3eh
Repair is unsuccessful, and after the fsck and repair the output for list-inconsistent-obj is as follows:
[root@hydra4 ~]# rados list-inconsistent-obj 9.26 --format=json-pretty { "epoch": 68132, "inconsistents": [] }
#9 Updated by Brad Hubbard over 6 years ago
It looks like this could be a problem with the allocator. I'm looking into whether we can switch allocators as a test.
#10 Updated by Charles Alva over 6 years ago
Brad Hubbard wrote:
It looks like this could be a problem with the allocator. I'm looking into whether we can switch allocators as a test.
Thanks for looking into this, Brad.
Is the data still safe? Or is it really corrupted?
I have experienced some pg inconsistencies with read errors across all OSDs like Edward's as well. Bluestore fsck and ceph pg repair could not fix them.
I had to delete the rbd image which contained the problematic pgs and restore it from backup. Ceph has been running fine since then.
#11 Updated by Brad Hubbard over 6 years ago
Charles Alva wrote:
Is the data still safe? Or is it really corrupted?
Still looking into the answer to this.
#12 Updated by Brad Hubbard over 6 years ago
In order to test whether the bitmap allocator is the culprit here you can add the following to ceph.conf and restart the OSDs.
bluefs_allocator = stupid
This is the default allocator going forward anyway so it would be a good data point to see whether the issue recurs using this allocator.
#13 Updated by Charles Alva over 6 years ago
Brad Hubbard wrote:
In order to test whether the bitmap allocator is the culprit here you can add the following to ceph.conf and restart the OSDs.
bluefs_allocator = stupid
This is the default allocator going forward anyway so it would be a good data point to see whether the issue recurs using this allocator.
Noted. Will try this today. Thanks!
#14 Updated by Sage Weil over 6 years ago
- Subject changed from How to fix Ceph Luminous 12.1.4 pg active+clean+inconsistent on Bluestore? to bluestore: multiple objects (clones?) referencing same blocks (on all replicas)
- Status changed from New to Need More Info
The fact that the same range on the object had a bad csum on all 3 replicas suggests a (reproducible) bluestore bug.
If you are able to reproduce this, I would love to see a debug bluestore = 20 log of fsck output:
CEPH_ARGS="--log-file c --debug-bluestore 20" ceph-objectstore-tool --op fsck --data-path /var/lib/ceph/osd/ceph-NNN
Thank you!
#15 Updated by Charles Alva over 6 years ago
Sage Weil wrote:
The fact that the same range on the object had a bad csum on all 3 replicas suggests a (reproducible) bluestore bug.
If you are able to reproduce this, I would love to see a debug bluestore = 20 log of fsck output:
[...]
Thank you!
Thanks Sage. Will do when hit this error again in the future.
Just upgraded to 12.2.0 flawlessly. Congrats!
#16 Updated by Edward Huyer over 6 years ago
I haven't seen any new inconsistencies since I switched to the "stupid" bluefs allocator Sunday night.
I'll see if I can get a debug bluestore = 20 output on one of my existing problem pgs some time soon.
#17 Updated by Brad Hubbard over 6 years ago
- Assignee set to Brad Hubbard
- ceph-qa-suite rados added
Excellent Edward, thanks for the update.
#18 Updated by Nicolas Drufin over 6 years ago
I have the same problem, my ceph-cluster as 3 pgs active+clean+inconsistent. When I try ceph pg repair, pg status add scrub+deep+repair but after 5s do nothings.
When I tail -f /var/log/ceph/ceph-osd.4.log
:
2017-09-04 14:16:38.709553 7ff71cd01700 0 log_channel(cluster) log [INF] : 6.c4 repair starts 2017-09-04 14:16:38.783564 7ff71cd01700 -1 bluestore(/var/lib/ceph/osd/ceph-4) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xe6d994d3, expected 0xca996b60, device location [0x75a160000~1000], logical extent 0x0~1000, object #6:2301a10f:::rbd_data.5bb7174b0dc51.000000000000da0a:17# 2017-09-04 14:16:39.879037 7ff71cd01700 -1 log_channel(cluster) log [ERR] : 6.c4 soid 6:2301a10f:::rbd_data.5bb7174b0dc51.000000000000da0a:17: failed to pick suitable object info 2017-09-04 14:16:58.636751 7ff71cd01700 -1 log_channel(cluster) log [ERR] : 6.c4 repair 2 errors, 0 fixed
When I
rados list-inconsistent-obj 6.c4 --format=json-pretty
:{ "epoch": 2295, "inconsistents": [] }
I try to add bluefs allocator = stupid
in /etc/ceph/ceph.conf
(common group) and restart all osd on all node but nothing change. I can run specific fsck command if I can contribute to resolve this bug.
#19 Updated by Brad Hubbard over 6 years ago
Nicolas,
Could you run the command Sage posted in comment #14?
Changing the allocator will not repair an existing issue AIUI but should stop further issues happening as there is evidence to suggest the bug is in the bitmap allocator.
#20 Updated by Nicolas Drufin over 6 years ago
I have run the command but log file is 500Mo size, how can I send you easily ?
#21 Updated by Brad Hubbard over 6 years ago
http://docs.ceph.com/docs/master/man/8/ceph-post-file/
Let us know the identifier (tag) when it completes (and be sure to compress it of course).
#22 Updated by Nicolas Drufin over 6 years ago
You can found it here : ceph-post-file: 258cf93d-a195-4996-af73-dd88aaf858cf
Thanks for your attention.
#23 Updated by Nicolas Drufin over 6 years ago
Since I have destroy osd and recreate with filestore but active+clean+inconsistent
persist. In ceph-osd.<osd-num>.log
I see a problem with clone missing, so I use command with <objectid>
found in log file:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-<osd-num>/ <objectid> remove-clone-metadata <cloneid>
Note : I do not know why but
<cloneid>
is different from log file so I use same command with dump
instead of remove-clone-metadata <cloneid>
to request this id.Need stop osd during operation and repair after restarted.
#24 Updated by Sage Weil over 6 years ago
Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)
Thanks!
#25 Updated by Charles Alva over 6 years ago
Sage Weil wrote:
Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)
Thanks!
Hi Sage,
The BlueStore version was from the first 12.1.x RC and upgraded to every minor version up to 12.1.4 RC which produced the errors. then upgraded to the latest 12.2.0 Stable.
Mostly RBD workload from the start with RBD snapshots invoked by qemu occasionally. Started from 12.1.4 RC, I deployed CephFS to store the VMs' image backups and ISO images.
I remarked the "bluefs_allocator = stupid" in ceph.conf when upgrading to 12.2.0 Stable and I haven't encountered any error since deleting and restoring the RBD from VM image backup 9 days ago.
#26 Updated by Nicolas Drufin over 6 years ago
Sage Weil wrote:
Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)
Thanks!
My ceph version is 12.1.2 and we use it for RBD to store lxc container with Proxmox. Incidents have been encountered when we have launched backup task on big lxc (up to 100Go) with proxmox and this has generated side effect on ceph cluster.
Now I remove clone metadata, ceph cluster work well.
#27 Updated by Edward Huyer over 6 years ago
Sage Weil wrote:
Ed, Charles, Nicolas: can you please share which version bluestore was initially deployed with, and also describe your general workload? (RBD? replicated pool? with snapshots?)
I started using Bluestore with Luminous 12.1.0. The workload is 100% RBD on replicated pools, with a mixture of kernel driver access and kvm/libvirt access via Proxmox 5. Proxmox is accessing one pool, while kernel access is to two others. Snapshots are an occasional thing, but not on a routine basis.
A few notes: I migrated a substantial amount of data (>60T) from old filestore OSDs to new Bluestore OSDs with no apparent issues. It was only later that the inconsistencies started cropping up. Further, it seems like the inconsistencies only appeared in the pool Proxmox accesses, though that may have simply been because that was the pool with the most traffic at the time.
#28 Updated by Brad Hubbard over 6 years ago
Ed, Charles, Nicolas,
So all three of you are using Proxmox? Could you let us know if you are currently using, or have used, Proxmox ceph packages (as opposed to packages released by the ceph project itself)?
#29 Updated by Nicolas Drufin over 6 years ago
I use ceph with proxmox 5 packages : ceph version 12.1.2 (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc) with pve-manager/5.0-30/5ab26bc (running kernel: 4.10.17-2-pve)
#30 Updated by Charles Alva over 6 years ago
Brad Hubbard wrote:
Ed, Charles, Nicolas,
So all three of you are using Proxmox? Could you let us know if you are currently using, or have used, Proxmox ceph packages (as opposed to packages released by the ceph project itself)?
Hi Brad,
Yes, I'm using Proxmox VE 5.0 (based on Debian 9.1) but with Ceph upstream packages. Didn't use the Proxmox Ceph packages as we have to manually deploy CephFS and they do not support ceph-deploy.
#31 Updated by Sage Weil over 6 years ago
Ok, unless you've seen new inconsistencies appear since 12.2.0, I think this is from #20983, fixed by d5ba7061ee588c232138af1d880faf09be4adeed, which appeared (backported) in 12.1.4.
We need to build a repair mode for bluestore fsck that can clean up inconsistencies like this (i'm sure we'll get future bugs that do similar damage).
In the meantime, are you folks able to sit tight with the current inconsistencies?
#32 Updated by Charles Alva over 6 years ago
I'm fine with it. Thanks for looking into this, Sage.
#33 Updated by Edward Huyer over 6 years ago
I can sit tight for a while if necessary. That said, if there is a straightforward and reliable way to fix it now (e.g., by destroying and recreating the affected OSDs one at a time), I'd prefer that, just to clear the cluster ERR state and stop the accumulation of old monitor maps.
Either way, thank you all for the help.
#34 Updated by WANG Guoqin over 6 years ago
Got a lot of inconsistent pgs spread out in all 19 OSDs of mine every time when I do deep scrub. Without deep scrub, or only doing scrub won't find them. Most of them are read_error but some others are shown "nothing inconsistent" in list-inconsistent-obj. (Right now all errors I can find are read_error.)
There were few at the beginning, and become more and more. Doing a deep-scrub on all OSD and repair all inconsistent ones seem to fix the problem, but another immediate deep-scrub will find some other ones, more or less, mostly not the same pgs as the last time. For example from last deep-scrub I did half an hour ago there were 43 inconsistent objs found, and this time there are 14.
ceph-bluestore-tool fsck doesn't fix the problems. And what's worse, sometimes they brought down osds. An osd may suddenly fail to start, with
bluestore(/var/lib/ceph/osd/ceph-13) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, ...and
osd.13 0 OSD::init() : unable to read osd superblockNeither was ceph-bluestore-tool fsck able to fix that, and the only thing I could do was purge and recreate the OSD. Hopefully two OSDs don't fail at the same time, and this doesn't result in data loss, but I'm not sure about that.
The cluster was rebuilt after the rescue from last cluster which was created quite a long time ago. This kind of errors happened from last cluster, and I decided to tar all the files from that, recreate a new one, the extract the files. The new cluster was created a week ago with 12.2.1. By that time stupid allocator should have become default I think.
I'm using cephfs on the cluster, with multiple MDS, and I'm not using Proxmox.
#35 Updated by Brad Hubbard over 6 years ago
Guoqin,
Could you run the command Sage posted in comment #14 as well as the following?
ceph daemon osd.N config show|grep allocator
Where 'N' is the number of one of the osds local to the machine.
#36 Updated by WANG Guoqin over 6 years ago
Brad Hubbard wrote:
Guoqin,
Could you run the command Sage posted in comment #14 as well as the following?
[...]
Where 'N' is the number of one of the osds local to the machine.
$ sudo ceph daemon osd.9 config show |grep allocator
"bluefs_allocator": "stupid",
"bluestore_allocator": "stupid",
"bluestore_bitmapallocator_blocks_per_zone": "1024",
"bluestore_bitmapallocator_span_size": "1024",
#37 Updated by WANG Guoqin over 6 years ago
And some other results,
$ sudo ceph pg dump |grep inconsistent dumped all 1.3c6 423 0 0 0 0 19757162 1507 1507 active+clean+inconsistent 2017-11-22 14:12:27.889745 7289'1725 7776:19232 [2,10,16] 2 [2,10,16] 2 7289'1725 2017-11-22 14:12:27.889057 7289'1725 2017-11-22 14:12:27.889057 1.3a1 467 0 0 0 0 54700693 1559 1559 active+clean+inconsistent 2017-11-22 14:04:34.226580 7736'1659 7776:10068 [3,6,9] 3 [3,6,9] 3 7736'1659 2017-11-22 14:04:34.224386 7736'1659 2017-11-22 14:04:34.224386 1.34f 470 0 0 0 0 19345117 1532 1532 active+clean+inconsistent 2017-11-22 13:59:48.431118 7283'1732 7776:19062 [4,18,13] 4 [4,18,13] 4 7283'1732 2017-11-22 13:59:48.430295 7283'1732 2017-11-22 13:59:48.430295 1.310 492 0 0 0 0 43605779 1514 1514 active+clean+inconsistent 2017-11-22 14:11:42.216931 7279'1814 7776:11269 [16,7,4] 16 [16,7,4] 16 7279'1814 2017-11-22 14:11:42.215517 7279'1814 2017-11-22 14:11:42.215517 1.2f0 523 0 0 0 0 38966027 1510 1510 active+clean+inconsistent 2017-11-22 14:05:29.599383 7736'1860 7776:15443 [2,11,6] 2 [2,11,6] 2 7736'1860 2017-11-22 14:05:29.598908 7736'1860 2017-11-22 14:05:29.598908 1.124 464 0 0 0 0 51081711 1526 1526 active+clean+inconsistent 2017-11-22 14:04:23.364628 7776'1671 7776:16745 [11,8,18] 11 [11,8,18] 11 7776'1671 2017-11-22 14:04:23.363316 7776'1671 2017-11-22 14:04:23.363316 1.df 470 0 0 0 0 24753615 1538 1538 active+clean+inconsistent 2017-11-22 14:03:03.474101 7287'1738 7776:25925 [9,4,3] 9 [9,4,3] 9 7287'1738 2017-11-22 14:03:03.473501 7287'1738 2017-11-22 14:03:03.473501 1.dc 468 0 0 0 0 50180203 1557 1557 active+clean+inconsistent 2017-11-22 14:06:02.745203 7276'1857 7776:16573 [6,1,4] 6 [6,1,4] 6 7276'1857 2017-11-22 14:06:02.744851 7276'1857 2017-11-22 14:06:02.744851 1.bd 437 0 0 0 0 36415780 1501 1501 active+clean+inconsistent 2017-11-22 14:07:47.460952 7771'1701 7776:10279 [14,6,4] 14 [14,6,4] 14 7771'1701 2017-11-22 14:07:47.460633 7771'1701 2017-11-22 14:07:47.460633 1.80 513 0 0 0 0 38691491 1587 1587 active+clean+inconsistent 2017-11-22 13:59:24.362883 7287'16287 7776:44532 [15,10,6] 15 [15,10,6] 15 7287'16287 2017-11-22 13:59:24.361918 7287'16287 2017-11-22 13:59:24.361918 1.233 514 0 0 0 0 40154691 1512 1512 active+clean+inconsistent 2017-11-22 14:01:26.236240 7776'1883 7776:21383 [13,10,17] 13 [13,10,17] 13 7776'1883 2017-11-22 14:01:26.235788 7776'1883 2017-11-22 14:01:26.235788 1.254 491 0 0 0 0 59784845 1592 1592 active+clean+inconsistent 2017-11-22 14:05:07.920558 7736'1792 7776:16344 [0,18,9] 0 [0,18,9] 0 7736'1792 2017-11-22 14:05:07.918659 7736'1792 2017-11-22 14:05:07.918659
$ for item in $(sudo ceph pg dump |grep inconsistent |awk '{print $1}') ; do sudo rados list-inconsistent-obj $item ; echo -e ; done dumped all {"epoch":7702,"inconsistents":[{"object":{"name":"1000004478f.00000000","nspace":"","locator":"","snap":"head","version":545},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:63ea9188:::1000004478f.00000000:head(305'545 mds.0.218:14115 dirty|data_digest|omap_digest s 527649 uv 545 dd a703c9c1 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":2,"primary":true,"errors":[],"size":527649,"omap_digest":"0xffffffff","data_digest":"0xa703c9c1"},{"osd":10,"primary":false,"errors":["read_error"],"size":527649},{"osd":16,"primary":false,"errors":[],"size":527649,"omap_digest":"0xffffffff","data_digest":"0xa703c9c1"}]}]} {"epoch":7726,"inconsistents":[{"object":{"name":"1000001e697.00000095","nspace":"","locator":"","snap":"head","version":150},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:85de8c0d:::1000001e697.00000095:head(101'150 client.24357.0:124811 dirty|data_digest|omap_digest s 4194304 uv 150 dd 1c823a35 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":3,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x1c823a35"},{"osd":6,"primary":false,"errors":["read_error"],"size":4194304},{"osd":9,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x1c823a35"}]}]} {"epoch":7746,"inconsistents":[{"object":{"name":"60000004e8f.00000008","nspace":"","locator":"","snap":"head","version":965},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:f2d14bb1:::60000004e8f.00000008:head(346'965 client.104221.0:1798817 dirty|data_digest|omap_digest s 4194304 uv 965 dd 9dfad556 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":4,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x9dfad556"},{"osd":13,"primary":false,"errors":["read_error"],"size":4194304},{"osd":18,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x9dfad556"}]}]} {"epoch":7758,"inconsistents":[{"object":{"name":"30000006eb5.0000000b","nspace":"","locator":"","snap":"head","version":1180},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:08df7d63:::30000006eb5.0000000b:head(352'1180 client.54211.0:165182 dirty|data_digest|omap_digest s 4194304 uv 1180 dd 5e336f88 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5e336f88"},{"osd":7,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5e336f88"},{"osd":16,"primary":true,"errors":["read_error"],"size":4194304}]}]} {"epoch":7702,"inconsistents":[{"object":{"name":"10000059133.0000005d","nspace":"","locator":"","snap":"head","version":1860},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:0f763fe2:::10000059133.0000005d:head(7736'1860 client.24755.0:43639 dirty|data_digest|omap_digest s 4194304 uv 1860 dd dfa8c5c8 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":2,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xdfa8c5c8"},{"osd":6,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xdfa8c5c8"},{"osd":11,"primary":false,"errors":["read_error"],"size":4194304}]}]} {"epoch":7758,"inconsistents":[{"object":{"name":"1000001e6b4.00000054","nspace":"","locator":"","snap":"head","version":180},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:24adc20a:::1000001e6b4.00000054:head(104'180 client.24357.0:127005 dirty|data_digest|omap_digest s 4194304 uv 180 dd 83c1b193 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":8,"primary":false,"errors":["read_error"],"size":4194304},{"osd":11,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x83c1b193"},{"osd":18,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x83c1b193"}]}]} {"epoch":7726,"inconsistents":[{"object":{"name":"30000031b2a.000000df","nspace":"","locator":"","snap":"head","version":1727},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:fb3795f1:::30000031b2a.000000df:head(5264'1727 client.54247.0:846 dirty|data_digest|omap_digest s 4194304 uv 1727 dd f4b391b0 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":3,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xf4b391b0"},{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xf4b391b0"},{"osd":9,"primary":true,"errors":["read_error"],"size":4194304}]}]} {"epoch":5746,"inconsistents":[{"object":{"name":"60000004e8f.0000019f","nspace":"","locator":"","snap":"head","version":1080},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:3b3aa48a:::60000004e8f.0000019f:head(346'1080 client.104221.0:1799645 dirty|data_digest|omap_digest s 4194304 uv 1080 dd 5a5d4280 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":1,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5a5d4280"},{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x5a5d4280"},{"osd":6,"primary":true,"errors":["read_error"],"size":4194304}]}]} {"epoch":5746,"inconsistents":[{"object":{"name":"10000059120.00000001","nspace":"","locator":"","snap":"head","version":1426},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:bd1fe198:::10000059120.00000001:head(673'1427 osd.14.0:5186 dirty|data_digest|omap_digest s 4194304 uv 1426 dd 95e2dff6 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":4,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x95e2dff6"},{"osd":6,"primary":false,"errors":["read_error"],"size":4194304},{"osd":14,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x95e2dff6"}]}]} {"epoch":7377,"inconsistents":[{"object":{"name":"1000005911e.00000044","nspace":"","locator":"","snap":"head","version":705},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:01206230:::1000005911e.00000044:head(654'1522 osd.15.0:505 dirty|data_digest|omap_digest s 4194304 uv 705 dd 582b1f11 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":6,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x582b1f11"},{"osd":10,"primary":false,"errors":["read_error"],"size":4194304},{"osd":15,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0x582b1f11"}]}]} {"epoch":7758,"inconsistents":[{"object":{"name":"3000000937a.00000000","nspace":"","locator":"","snap":"head","version":1224},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:cc785be8:::3000000937a.00000000:head(352'1224 mds.2.161:96962 dirty|data_digest|omap_digest s 4194304 uv 1224 dd a97350b1 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":10,"primary":false,"errors":["read_error"],"size":4194304},{"osd":13,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xa97350b1"},{"osd":17,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xa97350b1"}]}]} {"epoch":7726,"inconsistents":[{"object":{"name":"1000001e6c6.0000005d","nspace":"","locator":"","snap":"head","version":195},"errors":[],"union_shard_errors":["read_error"],"selected_object_info":"1:2a514ce6:::1000001e6c6.0000005d:head(111'195 client.24357.0:128927 dirty|data_digest|omap_digest s 4194304 uv 195 dd af2b24d2 od ffffffff alloc_hint [0 0 0])","shards":[{"osd":0,"primary":true,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xaf2b24d2"},{"osd":9,"primary":false,"errors":["read_error"],"size":4194304},{"osd":18,"primary":false,"errors":[],"size":4194304,"omap_digest":"0xffffffff","data_digest":"0xaf2b24d2"}]}]}
#38 Updated by Brad Hubbard over 6 years ago
Guoqin,
Note: Your issue appears to be different in that you don't seem to have any pgs where all replicas are showing read errors.
Could you run the command Sage posted in comment #14 on one of the OSDs showing a read error as well as the following?
$ ceph report
Could you capture deep scrub logs with "debug_osd 20" and "debug-bluestore 20"?
Please upload the results either by compressing then and attaching them here or by using ceph-post-file.
#39 Updated by Greg Farnum over 6 years ago
- Project changed from RADOS to bluestore
#40 Updated by Sage Weil about 6 years ago
- Status changed from Need More Info to Resolved
The original bug here is fixed. Meanwhile, Igor is working on a repair function for ceph-bluestore-tool that will correct damaged osds; see https://github.com/ceph/ceph/pull/19843