Project

General

Profile

Actions

Bug #58707

open

rbd snapshot corruption, likely due to snaptrim

Added by Roel van Meer about 1 year ago. Updated about 2 months ago.

Status:
Need More Info
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
rbd snaptrim
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Dear maintainers,

We have one Ceph pool where rbd snapshots are being corrupted. This happens within hours of the snapshot creation. Corruption does not happen when snaptrim is disabled.

We're using Proxmox 7 with their Ceph packages. The problem existed with Ceph Octopus (15.2.17), and still exists with Pacific (16.2.9).

The corruption was first detected by VMs not being able to boot, or displaying severe filesystem corruption after reverting to a snapshot, and later confirmed by doing a sha256sum on a mapped snapshot periodically. Comparison of a copy of the original snapshot and a copy of a corrupted snapshot showed that several 4MB chunks that had data in the original snapshot had only zeroes in the corrupted snapshot.

This is reproducible for us, but only on the one pool of one Ceph cluster (we manage several).

Our reproduction is simple:

- Create a VM with a 10GB disk in Proxmox
- Do a standard Debian install
- Shutdown the VM
- Create a snapshot
- Start the VM

Now, run a script like this, every 10 minutes:

dev=$(rbd -p ssd map vm-9707-disk-0@snapshot)
sum=$(sha256sum $dev)
logger "Checksum of vm-9707-disk-0@snapshot is $sum"
rbd -p ssd unmap vm-9707-disk-0@snapshot

Result: The logged checksum changes. Sometimes the change happens after one or two hours, sometimes it has already changed after 10 minutes.

While running Ceph with 'nosnaptrim' for about 8 hours, no changes occurred in the snapshot's checksum.

The cluster has an OK health, and no (for me) strange messages have been showing up in the logs, although we might need to run with a higher logging level to get any useful information.

This is a production cluster, so I'm limited what I can do here, but I'm happy and willing to provide any and all information necessary to help figure this out.


Files

ceph-df.txt (736 Bytes) ceph-df.txt ceph df output Roel van Meer, 02/13/2023 01:50 PM
ceph-status.txt (855 Bytes) ceph-status.txt ceph status output Roel van Meer, 02/13/2023 01:50 PM
rbd-du.txt (35.3 KB) rbd-du.txt rbd -p ssd du - output Roel van Meer, 02/14/2023 07:05 AM
snapshot-checksum.txt (78.8 KB) snapshot-checksum.txt Checksum of vm-9707-disk-0@snapshot over time Roel van Meer, 02/14/2023 07:05 AM
rbd-info-vm-9707-disk-0.txt (406 Bytes) rbd-info-vm-9707-disk-0.txt Output of: rbd -p ssd info vm-9707-disk-0 Roel van Meer, 02/14/2023 10:56 PM
rbd-info-vm-9707-disk-0_snapshot.txt (424 Bytes) rbd-info-vm-9707-disk-0_snapshot.txt Output of: rbd -p ssd info vm-9707-disk-0@snapshot Roel van Meer, 02/14/2023 10:56 PM
9718-info.txt (856 Bytes) 9718-info.txt Roel van Meer, 02/17/2023 07:21 AM
pve09-journal.txt (615 KB) pve09-journal.txt Kernel log from the hypervisor Roel van Meer, 02/23/2023 09:48 PM
9721-info.txt (91.5 KB) 9721-info.txt System info on reproducer 9721 Roel van Meer, 02/23/2023 10:05 PM
roel-disk-4-snapshot-create.log (103 KB) roel-disk-4-snapshot-create.log Debug log of rbd snap create of roel-disk-4@snapshot Roel van Meer, 03/13/2023 01:09 PM
test-NBD-roel-disk-6.txt (3.88 KB) test-NBD-roel-disk-6.txt Documented NBD test Roel van Meer, 03/14/2023 03:00 PM
roel-disk-8.tar.gz (5.11 KB) roel-disk-8.tar.gz Output and scripts of per-block test with listsnaps output Roel van Meer, 03/14/2023 04:23 PM
ceph-osd-pool-ls-detail.txt (2.06 KB) ceph-osd-pool-ls-detail.txt Roel van Meer, 03/17/2023 01:15 PM
missing-blocks-pgs.txt (97.3 KB) missing-blocks-pgs.txt Roel van Meer, 03/20/2023 04:02 PM
ceph-osd.56.log-20230404.txt.gz (130 KB) ceph-osd.56.log-20230404.txt.gz OSD 56 log Roel van Meer, 04/04/2023 06:38 AM
ceph-osd.56.log.txt.gz (159 KB) ceph-osd.56.log.txt.gz OSD 56 log with additional snapmapper debug Roel van Meer, 04/05/2023 07:00 AM
purged-snaps-omap.tar.gz (218 KB) purged-snaps-omap.tar.gz Roel van Meer, 04/25/2023 01:45 PM
Actions #1

Updated by Ilya Dryomov about 1 year ago

  • Status changed from New to Need More Info
  • Target version deleted (v16.2.12)

Hi Roel,

After starting the VM, what sort of workload is running there? What is the rate of change for the backing image?

Is the snapshot in question (the one that is being checksummed) the only snapshot for that image? If not, how many there are? Are these other snapshots being created and/or deleted in parallel?

What causes snaptrim? How many snapshots are being deleted in total? Is there some kind of schedule?

Actions #2

Updated by Roel van Meer about 1 year ago

Hi Ilya,

this is a plain, freshly installed Debian 11, that does nothing. So, the workload is some logging, that's all.
It is not specific to the VM, because new VMs that we set up have the same issue.

Yes, the snapshot is the only snapshot on the image. We haven't checked if making multiple snapshots changes anything about the problem.

I'm sorry, but I don't know enough about snaptrim to answer your question. I just found out that running the cluster with noscrub and nodeep-scrub did not stop the corruption, but running the cluster with nosnaptrim did (or at least it did for so long that I assume that it does).

Are there any specific commands that I can run to give you the information you need, regarding snaptrim?

Actions #3

Updated by Ilya Dryomov about 1 year ago

The attached "ceph status" output doesn't show any PGs in snaptrim state. "snaptrim" is short for snapshot trimming. Some snapshots must be created for it to occur.

If this is a fresh image with a single snapshot created on it, most likely there are many other images with snapshots and I suspect that snapshots are being created and deleted in the background (since you said this was a production cluster). I'm trying to understand the scale of that -- how many snapshots there are in the system in total, how many might be getting created or deleted at roughly the same time, etc. The question is not about snaptrim specifics but about the workload and everything that is going with the cluster on the RBD side from a user/admin POV.

Updated by Roel van Meer about 1 year ago

Good morning Ilya,

There's not much going on in this pool with regards to snapshots. There are a few snapshots that have been created a long while ago, and maybe 15 or so related to our testing. Apart from that, no snapshots are created or removed. In total there are 615 RBD images, and 28 snapshots. We're not using snapshots here in production.
I've attached the output of 'rbd du'.

Maybe my conclusion about snaptrim being involved was premature. I've also attached the checksum log that we're making of one snapshot of one VM, with the small script that I put in the first message. In that log you can see that the snapshot doesn't change between Feb 12 08:20 and Feb 12 17:00. On Feb 12, I set nosnaptrim on the cluster at around 09:00, and I unset it again at 17:00. Hence my conclusion. But of course it could be a coincidence or have an effect via a completely different route.

Actions #5

Updated by Roel van Meer about 1 year ago

Something that might or might not be relevant to mention is the history of this cluster. IIRC it was installed with Nautilus. During the upgrade to Octopus, we were bitten by two Bugs (#51619 and #51682). The helpful people of Croit.io assisted us with recovery, and at that time the pool that is now exhibiting these issues was created anew. One of the things that was important then was that we had high pool ID numbers. The affected pool still does, and the device_health_metrics pool does as well (ids 15213 and 15113).

Actions #6

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

Good morning Ilya,

There's not much going on in this pool with regards to snapshots. There are a few snapshots that have been created a long while ago, and maybe 15 or so related to our testing. Apart from that, no snapshots are created or removed. In total there are 615 RBD images, and 28 snapshots. We're not using snapshots here in production.
I've attached the output of 'rbd du'.

Hi Roel,

Your reproduction steps mention starting the VM:

- What driver does the VM use to access the image -- librbd (the image is opened directly by QEMU) or krbd (/dev/rbdXYZ device is passed to QEMU)?

- Have you tried skipping starting the VM -- creating a snapshot of a powered off VM and then monitoring it the same way? If not, can you try that?

- Can you try amending the monitoring script so that the snapshot is mapped just once, immediately after it's taken and never unmapped? The iteration would consist of only computing the checksum.

- Either separately or as part of the previous amendment, can you try changing "sha256sum $dev" to something like "dd if=$dev iflag=direct bs=4M | sha256sum"?

For all of these suggestions, a single occurrence of a flipping checksum would be enough to rule out at least one layer of the stack.

Maybe my conclusion about snaptrim being involved was premature. I've also attached the checksum log that we're making of one snapshot of one VM, with the small script that I put in the first message. In that log you can see that the snapshot doesn't change between Feb 12 08:20 and Feb 12 17:00. On Feb 12, I set nosnaptrim on the cluster at around 09:00, and I unset it again at 17:00. Hence my conclusion. But of course it could be a coincidence or have an effect via a completely different route.

Yeah, while that span is the longest, there are others, quite long as well. I did some awking on the attached log file, here is the top five:

   53   3376ec94dfecd32914194cfd6c254f786ebf9ce115dbd0b8eb97aa72bdacda08    Feb 12 08:20:57 - Feb 12 17:00:57
   37   2698191ab490f19fe3ea0daafc81196b9421e3645ff84d21f1b7ee2c44d19aaa    Feb 10 23:20:57 - Feb 11 05:20:57
   26   2d1eb5430a195deb4b23b680d5f77dd0224093b38de1623ecd8042c6a3484a6a    Feb 12 17:10:57 - Feb 12 21:20:58
   23   42653244d62294cb64ccdf8b9c57be971935bfe29320bd36d187db887713144d    Feb 11 13:40:57 - Feb 11 17:20:55
   18   862d9a247689d76c089793b4a32eb0796aec855b1962d2252ed9e4d12e7866f8    Feb 12 21:30:57 - Feb 13 00:20:58

The second longest lasted six hours, the third - four hours, so nosnaptrim could be a coincidence. Can you try setting it for a longer period of time and seeing if you get absolute "control" over these spans in the log that way?

Actions #7

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

Something that might or might not be relevant to mention is the history of this cluster. IIRC it was installed with Nautilus. During the upgrade to Octopus, we were bitten by two Bugs (#51619 and #51682). The helpful people of Croit.io assisted us with recovery, and at that time the pool that is now exhibiting these issues was created anew. One of the things that was important then was that we had high pool ID numbers. The affected pool still does, and the device_health_metrics pool does as well (ids 15213 and 15113).

I doubt that the pool IDs matter but erroneously removed SharedBlob entries (https://tracker.ceph.com/issues/51619) certainly sounds like something that would break snapshots. Given that

This is reproducible for us, but only on the one pool of one Ceph cluster (we manage several).

the recovery is highly suspect. What did it entail?

What exactly do you mean by the pool being "created anew"? Was there some kind of import into this pool or all images listed in the attached "rbd du" output were created post-upgrade and recovery?

Can you attach the output of "rbd info vm-9707-disk-0" and "rbd info vm-9707-disk-0@snapshot"?

Actions #8

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

- What driver does the VM use to access the image

That would be krbd.

As for the other questions, I'll do the tests that you specified. It might take me a day or two to get back to you though.

In any case, thanks a lot for your time and thinking with us!

Actions #9

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

That would be krbd.

What is the kernel version on the node(s) where RBD devices are mapped?

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

I doubt that the pool IDs matter but erroneously removed SharedBlob entries (https://tracker.ceph.com/issues/51619) certainly sounds like something that would break snapshots. Given that

This is reproducible for us, but only on the one pool of one Ceph cluster (we manage several).

the recovery is highly suspect. What did it entail?

Recovery was done by moving all VM images from Ceph storage to local storage via the Proxmox web interface (I am unsure what mechanism Proxmox uses to do that). Then the pool was destroyed, a new pool was created, and the VM images were moved back after having ran some VMs off the new pool for a number of days without issues.
What I forgot to mention is that this was about one and a half year ago. All snapshots that we have on this pool have been made after that time. Also, the VMs that we're using as reproducers were created only last week.

What exactly do you mean by the pool being "created anew"? Was there some kind of import into this pool or all images listed in the attached "rbd du" output were created post-upgrade and recovery?

No, definitely not. A think part of the images (maybe half?) was created pre-upgrade (so while on Nautilus) and subjected to the temporary move to local storage as described above. Most of the other half will have been created on Octopus. The latest reproducers were created on Pacific.

Can you attach the output of "rbd info vm-9707-disk-0" and "rbd info vm-9707-disk-0@snapshot"?

Yes, done.

Actions #11

Updated by Roel van Meer about 1 year ago

What is the kernel version on the node(s) where RBD devices are mapped?

Linux pve08-dlf 5.15.83-1-pve #1 SMP PVE 5.15.83-1 (2022-12-15T00:00Z) x86_64 GNU/Linux

This is a proxmox kernel. We've probably seen it with earlier kernels too though, although I couldn't tell you which ones at this point. I can do some digging if that's useful.

Actions #12

Updated by Roel van Meer about 1 year ago

I did four new tests today, with four VMs, all started at approx the same time.

9708: Our standard reproduction: checksum changed after 1h20
9709: VM not started: No checksum change after 10h and counting
9710: Snapshot mapped once, not unmapped: checksum changed after 0h45
9711: rbd device read with dd, 4MB blocks and direct io (with map/unmap): checksum changed after 0h45

Times are approximations to about 10 minutes of course.

Tomorrow I'll run a set of tests while having nosnaptrim enabled.

Actions #13

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

9711: rbd device read with dd, 4MB blocks and direct io (with map/unmap): checksum changed after 0h45

For completeness, can you repeat this one with snapshot being mapped just once?

Actions #14

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

For completeness, can you repeat this one with snapshot being mapped just once?

Yes, will do. By the way, which is the preferred method of checksumming for future tests?

Actions #15

Updated by Roel van Meer about 1 year ago

Also, the VMs are configured with VirtIO, writeback cache, and discard on.
Would it be useful to see if any of these have influence on the problem? If so, which ones would be most useful to test first?

Actions #16

Updated by Roel van Meer about 1 year ago

For completeness, can you repeat this one with snapshot being mapped just once?

9712: rbd device read with dd, 4MB blocks and direct io, mapped once: checksum changed after 2h30

Actions #17

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

Also, the VMs are configured with VirtIO, writeback cache, and discard on.
Would it be useful to see if any of these have influence on the problem? If so, which ones would be most useful to test first?

Possibly, but I'd like to try a few things on the RBD side first.

I'm guessing that the "VM not started: No checksum change after 10h and counting" test continues to run (i.e. hasn't failed)?

If so, can you amend the image creation step to disable the object-map image feature and repeat any of reproducers, preferably the one labeled 9712?

$ rbd create --size 10G <img>
$ rbd feature disable <img> object-map  # or some Proxmox UI equivalent
<proceed with creating a VM, doing a standard Debian install, etc as per your steps>
Actions #18

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

I'm guessing that the "VM not started: No checksum change after 10h and counting" test continues to run (i.e. hasn't failed)?

It ran for 22 hours without failure. After that I stopped the test.

If so, can you amend the image creation step to disable the object-map image feature and repeat any of reproducers, preferably the one labeled 9712?

9716: Mapped once, read with dd directio, object-map disabled: checksum changed after 0h50, but INVALID (something has enabled the object-map again, as checked by rbd info)

I'll have to try this one again. Not sure how or where the object-map got enabled again.

Actions #19

Updated by Roel van Meer about 1 year ago

If so, can you amend the image creation step to disable the object-map image feature and repeat any of reproducers, preferably the one labeled 9712?

This time we did it correctly:

9718: Mapped once, read with dd directio, object-map disabled: checksum changed after 2h50

Note to future self: starting a VM from the Proxmox UI enables the object-map feature on the rbd image.

Actions #20

Updated by Roel van Meer about 1 year ago

Hi Ilya,

I hope you had a good weekend!
Are there any other things I can test in order to help determine the root cause of this?

Best regards, Roel

Actions #21

Updated by Ilya Dryomov about 1 year ago

Hi Roel,

So far I see only one variable that contributes to the reproducer: whether the VM is running. Given the nature of the corruption ("several 4MB chunks that had data in the original snapshot had only zeroes in the corrupted snapshot"), I suspected the object-map image feature but per "9718" test it appears to be not involved.

Let's try to hone in on the VM. You mentioned that the workload is "some logging". Can you be more specific?

- Is the RBD-backed device used for the root filesystem?
- How is the filesystem on top of the RBD-backed device mounted? Is discard mount option present?
- What is the workload exactly?
- Did you attempt to correlate the time it takes for the corruption to occur (sometimes less than an hour, sometimes many hours) with what is going in the VM? Perhaps the workload changes or something is invoking e.g. fstrim?

You also said you were going to run a set of tests with nosnaptrim to confirm to deny that hypothesis. Did that happen?

Can you attach the kernel log ("journalctl -k") from the hypervisor?

Actions #22

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

You also said you were going to run a set of tests with nosnaptrim to confirm to deny that hypothesis. Did that happen?

I did so yesterday. I've started four VMs on two different hypervisors. Nosnaptrim was set. They ran all day (08:00-21:30) without corruption in their snapshot. I unset nosnaptrim at 21:30, and at 21:40 all four of the snapshots were corrupted.

That makes it very likely that snaptrim really is involved somehow.

Actions #23

Updated by Roel van Meer about 1 year ago

I also did two more tests with different settings on the disk in Proxmox:

9720: Disk configured without discard option: Corruption after 0h40
9721: Disk configured without cache option: Corruption after 5h00

It surprised me that the no cache reproducer took so long to trigger. I haven't look at that further, so I don't know if it's relevant or just a coincidence.

Actions #24

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

Can you attach the kernel log ("journalctl -k") from the hypervisor?

Done

Actions #25

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

Let's try to hone in on the VM. You mentioned that the workload is "some logging". Can you be more specific?

Not really, because that's really all that the VMs do. :)
These are minimally installed Debian VMs. They aren't used by anyone, just started.
Cron is running, but no jobs in cron.hourly, and the issue happens so quickly after starting that it isn't related to any tasks in cron.daily.

- Is the RBD-backed device used for the root filesystem?

Yes.

- How is the filesystem on top of the RBD-backed device mounted? Is discard mount option present?

This is ext4, mounted with rw,relatime,errors=remount-ro

- What is the workload exactly?

Well, nothing. It's just sitting being idle. It runs sshd, logger, a non-functional salt minion, and it's not being used.
I've attached a file with some information about the VM: mount info, running processes, journal.

- Did you attempt to correlate the time it takes for the corruption to occur (sometimes less than an hour, sometimes many hours) with what is going in the VM? Perhaps the workload changes or something is invoking e.g. fstrim?

Yes, but i couldn't find anything.

What I can (and will) do, is try to reproduce the problem with an even simpler VM, e.g. boot it with /bin/bash as init. Doing that doesn't issue any writes to the disk. I can also try to dismantle the VM up to the point where the reproduction fails. Maybe that tells us something. (I suspect we need writes to the disk for the problem to occur though.)

Actions #26

Updated by Roel van Meer about 1 year ago

Coming back to the nosnaptrim result, is it possible to increase debug logging in such a way that we can know what is being done during snapshot trimming (or any of the processes that are disabled when running with nosnaptrim)? Also, is it possible to enable this logging without impacting performance in a major way? Some impact is expected and allowed of course, but the system will have to remain usable for our customers.

I'm thinking along the following lines:
- Enable nosnaptrim on the cluster
- Start up several reproducers
- Leave them running for a few hours
- Increase debug logging of the subsystems involved
- Unset nosnaptrim
- Wait for issue (which happened within 10 minutes last time)
- Disable debug logging

This way we'd only have to run with debug logging for a very short time.
I can also make copies of the snapshots before and after the issue triggers so we have clear information about which blocks were zero'd. This might be more useful than just knowing that the snapshot changed, probably.

Actions #27

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

- Did you attempt to correlate the time it takes for the corruption to occur (sometimes less than an hour, sometimes many hours) with what is going in the VM? Perhaps the workload changes or something is invoking e.g. fstrim?

Yes, but i couldn't find anything.

Just to close this thread, can you check if running "sudo fstrim -v /" inside the VM shortly after it boots triggers the issue?

What I can (and will) do, is try to reproduce the problem with an even simpler VM, e.g. boot it with /bin/bash as init. Doing that doesn't issue any writes to the disk. I can also try to dismantle the VM up to the point where the reproduction fails. Maybe that tells us something. (I suspect we need writes to the disk for the problem to occur though.)

This is a good idea. I would also run some simple tests that don't involve a VM, directly on the hypervisor node. Something along the lines of: create the VM image, take the snapshot as usual, map the VM image directly with "sudo rbd map <img>" and then:

- write 4K, 16K, 1M, 4M to random offsets, check the snapshot after each write
- write a big multi-object range, check the snapshot
- discard 64K, 1M, and 4M at random but aligned offsets, check the snapshot after each discard
- discard a big multi-object range, check the snapshot

Here is an example:

$ dd if=/dev/urandom of=$DEV oflag=direct bs=4K count=1 seek=1500001
$ dd if=/dev/urandom of=$DEV oflag=direct bs=16K count=1 seek=200002
$ dd if=/dev/urandom of=$DEV oflag=direct bs=1M count=1 seek=3939
$ dd if=/dev/urandom of=$DEV oflag=direct bs=4M count=1 seek=654
$ dd if=/dev/urandom of=$DEV oflag=direct bs=4M count=120 seek=1700
$ blkdiscard -o 6400064K -l 64K $DEV
$ blkdiscard -o 1234M -l 1M $DEV
$ blkdiscard -o 200M -l 4M $DEV
$ blkdiscard -o 400M -l 2G $DEV

I saw in 9721-info.txt that the image is partitioned -- you could repeat the above both against the main partition and against the whole device. If it doesn't reproduce immediately (i.e. all checks pass), create a new VM image and repeat the usual steps, wait until that snapshot goes bad and redo the check on this image. If still no luck, run the above commands against a fully allocated image (instead of creating a VM image, just map a fresh image, "dd if=/dev/urandom of=$DEV oflag=direct bs=4M count=2560", unmap, take a snapshot and go on).

When creating a snapshot one of these times, append "--debug-ms 1 --debug-rbd 20" to "rbd snap create" command and attach the output.

Roel van Meer wrote:

Coming back to the nosnaptrim result, is it possible to increase debug logging in such a way that we can know what is being done during snapshot trimming (or any of the processes that are disabled when running with nosnaptrim)? Also, is it possible to enable this logging without impacting performance in a major way? Some impact is expected and allowed of course, but the system will have to remain usable for our customers.

It shouldn't impact the performance too much but it's going to generate gigabytes of logs and the RADOS team would probably need to get involved. Let's see if you can nail a better reproducer in the background first. If the issue is really with snap trimming, it would be nice to completely exclude RBD to make further debugging easier.

Actions #28

Updated by Roel van Meer about 1 year ago

Hi Ilya,

I've managed to make a much simpler (and smaller) reproducer, without any VM involvement. If I create an image manually, fill it with data, make a snapshot, and then either overwrite the original data or discard it, the snapshot changes. This was with an image of 200M. I've also switched from checksumming the entire image at once, to checksumming the image in 4M increments (so a checksum per block) so that I can see which and how many blocks change. The changed blocks all have a new checksum that is the same as a 4MB block of zeroes.

So with this, I'm down to:

rbd -p ssd create -s 200M roel-disk-1
rbd -p ssd map roel-disk-1
dd if=/dev/urandom of=/dev/rbdX
rbd -p ssd snap create roel-disk-1@snapshot
drd if=/dev/urandom of=/dev/rbdX

Hopefully I have time to look at the rest of your advice today or tomorrow.

Actions #29

Updated by Roel van Meer about 1 year ago

Hi Ilya,

I've created an image similar to the latest reproducer, with debugging, as you asked.

rbd -p ssd create -s 200M roel-disk-4
rbd -p ssd map roel-disk-4
dd if=/dev/urandom of=/dev/rbd16
rbd -p ssd --debug-ms 1 --debug-rbd 20 snap create roel-disk-4@snapshot 2>&1 | tee -a /tmp/roel-disk-4-snapshot-create.log

I hope this is the information you needed. If not, please let me know.

Actions #30

Updated by Roel van Meer about 1 year ago

Also, some more information about the nature of the problem. These are the results of the test I ran as specified in comment #58707-28 (roel-disk-1).

With this test, I did not make a single checksum of the entire image, but I checksummed each 4M-block individually. This was done every 10 minutes. The output was saved so I could compare it later. Below, I've made a list of the approximate duration between overwriting the original image, and the number of 4M-blocks in the snapshot that changed to all-zeroes.
As can be seen is that roughly every one to two hours some more blocks change to zeroes, although there are also times when a change is seen every 10 or 20 minutes.
So the corruption is not single event happening on the image, but looks more like a recurring process that takes down a few blocks at a time. The results of the test with roel-disk-2, where I did not overwrite the original image, but ran blkdiscard on it, look similar but are not identical.

00h00: overwrite of image
01h20: 2 blocks zeroed
02h20: 4 blocks zeroed
03h10: 5 blocks
05h00: 6 blocks
07h00: 8 blocks
07h40: 10 blocks
08h00: 12 blocks
08h20: 13 blocks
09h30: 14 blocks
10h40: 15 blocks
11h10: 16 blocks
12h00: 20 blocks
13h20: 21 blocks
14h20: 22 blocks
15h40: 23 blocks
16h30: 26 blocks
17h40: 28 blocks
19h30: 31 blocks
19h40: 32 blocks
19h50: 35 blocks
20h10: 36 blocks
20h50: 37 blocks
22h00: 39 blocks
22h40: 41 blocks
23h10: 42 blocks
25h30: 44 blocks
25h50: 45 blocks
26h10: 46 blocks
28h10: 47 blocks
31h50: 48 blocks
32h20: 49 blocks
32h50: 50 (all) blocks zeroed
Actions #31

Updated by Ilya Dryomov about 1 year ago

Hi Roel,

Great job on a simple reproducer! I don't see anything wrong in the "snap create" debugging output and your most recent comment just strengthens my suspicion that this issue has nothing to do with RBD.

Let's try to exclude kernel RBD first. Can you repeat the latest reproducer with NBD device type?

  • map with "rbd map -t nbd ..."
  • use /dev/nbdX device
  • unmap with "rbd unmap -t nbd ..."

Everything else should remain the same (and no need to do the extended per-block thing -- just looking for a yes or no here).

Actions #32

Updated by Ilya Dryomov about 1 year ago

Sorry, one more thing. I'm not sure if "dd if=/dev/urandom of=/dev/rbdX" above is the verbatim command you used or just a shorthand. I would suggest to always use "oflag=direct bs=4M" in these tests, especially with NBD device type.

Actions #33

Updated by Ilya Dryomov about 1 year ago

For the extended per-block test, it would be good to confirm that zeroes are returned because the respective clone somehow gets deleted. If you are planning to repeat it, please collect the following output every X minutes:

$ for obj in $(rados -p ssd ls | grep $(rbd info <img> | awk '/block_name_prefix/{print $2}') | sort); do rados -p ssd listsnaps $obj; done
Actions #34

Updated by Roel van Meer about 1 year ago

Hi Ilya,

quick results here: The same thing is happening when using NBD. Almost immediately after overwriting the original image, reading the snapshot returned two 4M blocks containing nulls.

Actions #35

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

Sorry, one more thing. I'm not sure if "dd if=/dev/urandom of=/dev/rbdX" above is the verbatim command you used or just a shorthand. I would suggest to always use "oflag=direct bs=4M" in these tests, especially with NBD device type.

I may not have used the direct flag in all tests, but I will in the future. Now running a second test with NBD just to be sure that I didn't make any mistakes.

Actions #36

Updated by Roel van Meer about 1 year ago

I re-did the NBD test and the results came out similar. I documented the commands and results.

Currently running the extended per-block test again while saving the listsnaps output of each block.

Actions #37

Updated by Roel van Meer about 1 year ago

Here are the results of the per-block test with listsnaps output.

Note that I had to do a variation of the command that you specified in note #58707-33, because rados -p ssd ls takes more than 15 minutes (I did not let it run to completion). But the results should be similar.

In the results, you'll find the output of the blocks check before and after the issue triggers, both of the image and the snapshot, and also the output of the listsnaps command before and after the issue triggers. They match up: the second listsnaps output doesn't list the clones of exactly the blocks that are returned as zeroes.

Actions #38

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

They match up: the second listsnaps output doesn't list the clones of exactly the blocks that are returned as zeroes.

Yup! At 16:55:

8ffe6397d6a7ec92bf1c218bff081848b4f8898d5b71b62c1d28b28753b333c2 (expected data)

rbd_data.e3c9a9fee508d1.000000000000001a:
cloneid    snaps    size    overlap
186    186    4194304    []
head    -    4194304

fa106f78713640c540e146bfd836502fdb278e1abf237dec000d668c4fecb72b (expected data)

rbd_data.e3c9a9fee508d1.000000000000002c:
cloneid    snaps    size    overlap
186    186    4194304    []
head    -    4194304

At 17:00:

bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8 (all zeroes)

rbd_data.e3c9a9fee508d1.000000000000001a:
cloneid    snaps    size    overlap
head    -    4194304

bb9f8df61474d25e71fa00722318cd387396ca1736605e1248821cc0de3d3af8 (all zeroes)

rbd_data.e3c9a9fee508d1.000000000000002c:
cloneid    snaps    size    overlap
head    -    4194304
Actions #39

Updated by Ilya Dryomov about 1 year ago

  • Project changed from rbd to Ceph
  • Assignee set to Radoslaw Zarzynski

Hi Radek,

Kicking this over to you as this appears to be either a Bluestore or an OSD issue. To summarize what Roel is seeing: an RBD snapshot gets corrupted some time after a part (or all) of the RBD image is overwritten. The nature of the corruption is that the snapshotted data just disappears: zeroes are returned instead when reading from a snapshot. It seems that the object clones are simply deleted underneath the RBD image even though the snapshot is still live. As expected, setting nosnaptrim prevents the corruption.

This happens both with kernel RBD and librbd but only on one particular pool of one particular cluster. There is some concerning history there (#58707-5) but the pool in question was reportedly re-created from scratch (#58707-10).

Actions #40

Updated by Ilya Dryomov about 1 year ago

  • Project changed from Ceph to bluestore
Actions #41

Updated by Radoslaw Zarzynski about 1 year ago

Hello Roel and Ilya!

The core team is here. I was reading over the conversation noting down [1] most interesting posts.

I guess the direction will be to simplify the reproducer even further by stripping RBD entirely and moving into pure rados-based one with very few RADOS objects underneath. Then, we could bump up debugging on very few OSDs and correlate with BlueStore / SnapTrimmer's activity.

I also let know Adam Kupczyk know and asked for investigating from the BS' side.

[1]: https://gist.github.com/rzarzynski/f97c7cb63950da101994630513385464

Actions #42

Updated by Radoslaw Zarzynski about 1 year ago

I wonder whether we could simplify the replicator around the following idea:

rados -p test-pool put obj /tmp/non_zero_4mb
rados -p test-pool mksnap snap1
rados -p test-pool put obj /tmp/other_non_zero_4mb
rados -p test-pool mksnap snap2
rados -p test-pool put obj /tmp/yet_another_non_zero_4mb
rados -p test-pool rmsnap snap1
rados -p test-pool get --snap snap2 obj - # for the data check
Actions #43

Updated by Roel van Meer about 1 year ago

Hi Radoslaw!

Thanks for taking the time to look at this problem!
If I try the above commands on my test cluster, I cannot run the mksnap command:

can't create snapshot: pool ssd is in selfmanaged snaps mode

Something that I didn't think of before: would it be useful information if we would create another pool on the same OSDs, to see if the problem shows up there too?

Actions #44

Updated by Ilya Dryomov about 1 year ago

Hi Radek,

Note that the issue occurs only on one particular (production!) pool and that Roel is trying these reproducers there. "rados mksnap" can't be used on RBD pools because it creates a pool-managed snapshot, while RBD uses self-managed snapshots. I would be very cautious there, as the two don't play together at all.

For a pure RADOS reproducer, I would suggest using rados_ioctx_selfmanaged_snap_create() API (available as create_self_managed_snap() in Python bindings) and setting a respective snap context on just a few test objects which can be removed later -- similar to what RBD does internally.

Actions #45

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

Something that I didn't think of before: would it be useful information if we would create another pool on the same OSDs, to see if the problem shows up there too?

Hi Roel,

Definitely! I sort of thought you have already tried that since the point about that particular pool was stressed a couple of times.

Actions #46

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

Definitely! I sort of thought you have already tried that since the point about that particular pool was stressed a couple of times.

Trying it now. Although of course the workload on this new pool is different, it currently has just the one image and one snapshot on it, so I'm not sure if it is comparable. At least it lives on the same OSDs as the affected pool.

Actions #47

Updated by Roel van Meer about 1 year ago

All right, that was quicker than I expected. The new pool also exhibits the problematic behaviour.

For the record, the pool was created with:

ceph osd pool create test1 128 128 replicated ssd --autoscale-mode=off
ceph osd pool application enable test1 rbd

I noticed that the pool didn't have the selfmanaged_snaps flag until the first rbd image was created.
Also, the output of ceph osd pool ls detail included

removed_snaps_queue [2~1]

for a short while after creating the pool. This was also the case on the test cluster, so I expect it is normal behaviour.

Actions #48

Updated by Radoslaw Zarzynski about 1 year ago

Hello Guys,

For a pure RADOS reproducer, I would suggest using rados_ioctx_selfmanaged_snap_create() API (available as create_self_managed_snap() in Python bindings) and setting a respective snap context on just a few test objects which can be removed later -- similar to what RBD does internally.

Yeah, this would make it more similar to what RBD does. The cost is Python or C coding as, AFAIK, none of the CLI tools exposes these calls (I guess due to burden of providing SnapContext).

$ grep -r create_self_managed_snap ./src
./src/pybind/rados/rados.pyx:    def create_self_managed_snap(self):
./src/test/pybind/test_rados.py:        snap_id_1 = self.ioctx.create_self_managed_snap()
./src/test/pybind/test_rados.py:        snap_id_2 = self.ioctx.create_self_managed_snap()
$ grep -r rados_ioctx_selfmanaged_snap_remove src/
src/include/rados/librados.h:CEPH_RADOS_API int rados_ioctx_selfmanaged_snap_remove(rados_ioctx_t io,
src/tracing/librados.tp:TRACEPOINT_EVENT(librados, rados_ioctx_selfmanaged_snap_remove_enter,
src/tracing/librados.tp:TRACEPOINT_EVENT(librados, rados_ioctx_selfmanaged_snap_remove_exit,
src/librados/librados_c.cc:extern "C" int LIBRADOS_C_API_DEFAULT_F(rados_ioctx_selfmanaged_snap_remove)(
src/librados/librados_c.cc:  tracepoint(librados, rados_ioctx_selfmanaged_snap_remove_enter, io, snapid);
src/librados/librados_c.cc:  tracepoint(librados, rados_ioctx_selfmanaged_snap_remove_exit, retval);
src/librados/librados_c.cc:LIBRADOS_C_API_BASE_DEFAULT(rados_ioctx_selfmanaged_snap_remove);
src/librados/librados_c.cc:  tracepoint(librados, rados_ioctx_selfmanaged_snap_remove_enter, io, snapid);
src/librados/librados_c.cc:  tracepoint(librados, rados_ioctx_selfmanaged_snap_remove_exit, 0);
src/pybind/rados/rados.pyx:            ret = rados_ioctx_selfmanaged_snap_remove(self.io, _snap_id)
src/pybind/rados/c_rados.pxd:    int rados_ioctx_selfmanaged_snap_remove(rados_ioctx_t io,
src/pybind/rados/mock_rados.pxi:    int rados_ioctx_selfmanaged_snap_remove(rados_ioctx_t io,
src/test/librados/snapshots_stats.cc:    ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps[snap]));
src/test/librados/snapshots_stats.cc:    ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps[snap]));
src/test/librados/snapshots.cc:  ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps.back()));
src/test/librados/snapshots.cc:  ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps.back()));
src/test/librados/snapshots.cc:  ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps.back()));
src/test/librados/snapshots.cc:  ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps.back()));
src/test/librados/snapshots.cc:  ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps.back()));
src/test/librados/snapshots.cc:  ASSERT_EQ(0, rados_ioctx_selfmanaged_snap_remove(ioctx, my_snaps.back()));

All right, that was quicker than I expected. The new pool also exhibits the problematic behaviour.

Have I understood correctly the data corruption happens on a fresh pool too?

If so, maybe we could revive the dumb rados replicator. From the perspective of purely SnapTrimmer and BlueStore there is little-to-none difference on how the SnapContext is provided.

Actions #49

Updated by Roel van Meer about 1 year ago

Radoslaw Zarzynski wrote:

Have I understood correctly the data corruption happens on a fresh pool too?

Yes, that is correct.

Actions #50

Updated by Radoslaw Zarzynski about 1 year ago

Maybe it's not about a pool but about an OSD.

The intention behind the dumb replicator is to isolate the problem: less objects, less PGs, less OSDs involved. Ideally there would one of each of those which would make tracking snaps' life over client / trimmer / BlueStore much easier.

However, maybe we could start just from lowering the number of PGs in a pool and ensuring it gets assigned to a particular OSD / OSDs? What do you think, Ilya?

Actions #51

Updated by Ilya Dryomov about 1 year ago

I was under the impression that the issue was specific to the existing production "ssd" pool. Now that we know that it reproduces on a freshly-created pool as well, it's completely up to you. A pool with a single PG is a good start, if that doesn't work we can try a pool with 128 PGs that all map to the same set of OSDs (pg_num vs pgp_num). We can also try a smaller (down to a single object) RBD image, completely exclude RBD with a small Python script and even do that pool-managed snapshot thing ;) I think the reason Roel settled on a 50-object RBD image is that it took many hours for all object clones to disappear (~33h in #58707-30) but the first (any) usually goes fairly quickly.

Roel, can you repeat with "ceph osd pool create test1 1 1 replicated ssd --autoscale-mode=off"?

Actions #52

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

I noticed that the pool didn't have the selfmanaged_snaps flag until the first rbd image was created.

This is expected.

Also, the output of ceph osd pool ls detail included

removed_snaps_queue [2~1]

for a short while after creating the pool. This was also the case on the test cluster, so I expect it is normal behaviour.

Yes, also expected -- this is a side-effect of "rbd pool init" command.

Actions #53

Updated by Radoslaw Zarzynski about 1 year ago

  • Priority changed from Normal to High
Actions #54

Updated by Roel van Meer about 1 year ago

Ilya Dryomov wrote:

Roel, can you repeat with "ceph osd pool create test1 1 1 replicated ssd --autoscale-mode=off"?

I've created 4 new pools with a single PG each, as specified in the above command. One reproducer has been running on each of those pools for two hours now and the problem hasn't triggered yet. I'll leave them running longer and report back.

Meanwhile, are there already other useful tests I can run, in case it's not reproducible on the single-PG pools? I can create pools with 8, 4 or 2 PGs and see if it is reproducible one those pools? You also said I could create a pool with 128 PGs that all map to the same set of OSDs, could you elaborate on how to achieve that?

Actions #55

Updated by Roel van Meer about 1 year ago

Also, maybe a wild thought: I know from previous reproducers which rados objects were affected. If I can locate the PGs that those live on, I can start making a list of OSDs that we know for sure to have had a PG somehow related to the problem. On the other hand, if only a select number of OSDs would be exhibiting the problem, then I would have expected more variation in the results of the reproducers. As it looks like now, if I leave them running long enough, in the end all of the snapshot blocks are gone. So, that would suggest to me that it's not a problem with a specific set of OSDs, but that all of them exhibit the same behavior.
Can't hurt to check though.

With that said, is there quick way to determine the PG that a specific rados object lives on, other than doing rados --pgid <id> ls on all PGs?

Actions #56

Updated by Ilya Dryomov about 1 year ago

Roel van Meer wrote:

I can create pools with 8, 4 or 2 PGs and see if it is reproducible one those pools?

Please do -- this is Radek's territory now, but reducing the number of entities of any kind (e.g. PGs) that are involved in a reproducer is always a good idea.

You also said I could create a pool with 128 PGs that all map to the same set of OSDs, could you elaborate on how to achieve that?

"ceph osd pool create <pool name> 128 1 replicated ssd --autoscale-mode=off"

With that said, is there quick way to determine the PG that a specific rados object lives on, other than doing rados --pgid <id> ls on all PGs?

"ceph osd map <pool name> <object name>", e.g. "ceph osd map ssd rbd_data.e3c9a9fee508d1.000000000000001a"

Actions #57

Updated by Radoslaw Zarzynski about 1 year ago

Looking for OSD / failure correlation is definitely a good idea. Do we have the data to check for the already recorded occurrences?

Actions #58

Updated by Ilya Dryomov about 1 year ago

Ilya Dryomov wrote:

"ceph osd pool create <pool name> 128 1 replicated ssd --autoscale-mode=off"

Forgot to mention yesterday: after running this command, monitor "ceph osd pool ls detail" output and wait until you see "... pg_num 128 pgp_num 1 ..." for that pool. pgp_num would start at 128 and be slowly decreasing -- it may take a couple of minutes.

Actions #59

Updated by Roel van Meer about 1 year ago

Hi Ilya, Radoslaw, here are some results.

After creating the new pool (test1) with 128 PGs I created 4 new pools with a single PG each (test2-test5), and I tried quite a few reproducers on them. The results are interesting!
On test1 (128 PGs), the problem reproduces as before, with one or two blocks at the time disappearing.
On test2,3,4 and 5 the problem reproduces differently: if the problem triggers, all blocks disappear at once (or at least within the same 5-minute interval between two tests). It takes a lot longer to trigger though. Imho both of these things make sense.

One thing is unexpected: I can no longer reproduce the problem on the production "ssd" pool! Currently running two reproducers, one has been running for 33 hours without triggering, one has been running for 14 hours without triggering. I don't know what to make of this exactly, but it makes me think it is less likely that the problem is something within the OSDs or PGs themelves, and more likely that the problem is caused by the high pool IDs or the history of this cluster. What do you think?

Actions #60

Updated by Roel van Meer about 1 year ago

Also, I'm still working on making an inventory of affected OSDs, but due to some other obligations I will have very little time to do any real work on this in the next week. Sorry about that. I'll try to answer any questions asap though.

Actions #61

Updated by Ilya Dryomov about 1 year ago

Hi Roel,

Yeah, the history of this cluster is highly suspect. Do the new test pools also have high pool IDs? Do you recall how you ended up with some pool IDs in 15xxx range?

Can you attach "ceph osd pool ls detail" output?

Actions #62

Updated by Roel van Meer about 1 year ago

For a while we ran a benchmark that created a new pool, ran some tests on it, and deleted the pool again. This practice was stopped after #51619 and #51682 surfaced on this cluster. Although originally the production pools had low pool IDs, during the recovery of the issues caused by the mentioned bugs the "ssd" pool was created anew, so then it had a high ID.

Actions #63

Updated by Radoslaw Zarzynski about 1 year ago

I tried quite a few reproducers on them. The results are interesting!

This looks interesting also from the point of analyzing what exactly an OSD (SnapTrimmer / BlueStore in particular) was doing. As the number of involved OSDs gets pretty low, maybe it's the moment to start gathering logs?

The crucial one in SnapTrimmer requires at least debug_osd=10.

#define dout_subsys ceph_subsys_osd

// ...

boost::statechart::result PrimaryLogPG::AwaitAsyncWork::react(const DoSnapWork&)
{
  // ...
  for (auto &&object: to_trim) {
    // Get next
    ldout(pg->cct, 10) << "AwaitAsyncWork react trimming " << object << dendl;

Though, I'm aware 10 could be too high for a production and likely we should try to isolate further. Some ideas to discuss:

1) designating some (few) OSDs we know the problem replicated on for hosting only the pool-for-replication (requires moving data to others),
2) adding new ones (they would lack the history; extra resources needed),
3) finding a way to clone OSDs and add them to a testing cluster,
4) refine / isolate the problem further, maybe by restricting the pool's size or decreasing the image size (perhaps both).

Actions #64

Updated by Roel van Meer about 1 year ago

Hi!

I've added a list of missing blocks, mapped to their PGs. Not all, but most of the OSDs are in this list, so that makes it unlikely that the problem is associated with just a few of the OSDs. I expect that if I would add more testcases, the remaining OSDs would be in this list as well.

After running more than 3 days, the reproducers currently running on the production "ssd" pool still haven't failed, so creating new pools did have an effect on the problem. This also makes it less likely that it is an OSD-specific problem, I would say.

As for logging, I'm currently rebuilding the Proxmox Ceph packages. I hope these builds are reproducible, so I can verify that my builds are identical to theirs. If that is the case, I could also rebuild after lowering the required log level for the crucial messages. That way, I think it would be feasible to get more logs from the cluster without impacting performance.

Actions #65

Updated by Roel van Meer about 1 year ago

My local ceph build succeeded, so I can start testing. (Unfortunately the build is not reproducible, so I'm not 100% sure that my build is identical to Proxmox' build, but I have faith).

Since the problem exists on a pool with 1 PG, only three OSDs are involved there. Do you need debug logs of all three or only of the primary?

Actions #66

Updated by Radoslaw Zarzynski about 1 year ago

The ability to rebuild brings entire new set of possibilities! Cool.

When it comes to just SnapTrimmer, primary's log should be enough as:

void PrimaryLogPG::kick_snap_trim()
{
  ceph_assert(is_active());
  ceph_assert(is_primary());
  if (is_clean() &&
      !state_test(PG_STATE_PREMERGE) &&
      !snap_trimq.empty()) {
    if (get_osdmap()->test_flag(CEPH_OSDMAP_NOSNAPTRIM)) {
      dout(10) << __func__ << ": nosnaptrim set, not kicking" << dendl;
    } else {
      dout(10) << __func__ << ": clean and snaps to trim, kicking" << dendl;
      reset_objects_trimmed();
      set_snaptrim_begin_stamp();
      snap_trimmer_machine.process_event(KickTrim());
    }
  }
}
Actions #67

Updated by Roel van Meer about 1 year ago

Hi Radoslaw!

Apologies for the delay, I've had an appendicitis to take care of first. But I'm back, with new results!

I did a test with a 200M image on a pool that has only one PG. The OSD having the primary copy was running with my custom version of ceph-osd where the log level of several items was decreased from 10 to 3, and the debug_osd level of that process was set to 3.

According to the test, the blocks disappeared between 23:00 and 23:05 yesterday. And lo and behold, at that time the log shows:

2023-04-03T23:00:51.673+0200 7f92cfdf9700  0 log_channel(cluster) log [DBG] : purged_snaps scrub starts
2023-04-03T23:00:51.677+0200 7f92cfdf9700 0 log_channel(cluster) log [DBG] : purged_snaps scrub found 50 strays

Why I didn't see this before is beyond me, and I like to apologize, as I might have saved everyone some time by reporting this earlier.

After 23:00:51 there's lots of logs of snaptrimmer deleting the blocks, which makes sense if Ceph thinks they are strays, right? So I think the next question would be is: why does that happen?

This test was done on the pool with id 15217, and the id of the rbd image is 8e0719387647cc. Yesterday's full log of OSD 56 is attached.

Could you please advise on how to continue? Thanks a lot!

Actions #68

Updated by Roel van Meer about 1 year ago

I'm now running a test with full debug logs from SnapMapper.cc.

Actions #69

Updated by Roel van Meer about 1 year ago

Hi Radoslaw!

I've got more debug logs. Again a test was done, this time with additional logging from SnapMapper.cc.
Timeline of the image/snapshot creation is as follows:

2023-04-04 12:45:40 rbd -p test4 create -s 200M test4-disk4
2023-04-04 12:45:52 rbd -p test4 map test4-disk4
2023-04-04 12:46:11 dd if=/dev/urandom of=/dev/rbd2 oflag=direct bs=4M count=50
2023-04-04 12:46:26 rbd -p test4 snap create test4-disk4@snapshot
2023-04-04 12:46:59 rbd -p test4 map test4-disk4@snapshot
2023-04-04 12:48:44 dd if=/dev/urandom of=/dev/rbd2 oflag=direct bs=4M count=50

On 2023-04-05, between 03:00 and 03:05 the snapshot disappears.
The attached log has lots of snap_mapper output around that time that is above my pay grade.

Pool id 15217, rbd id 98301b80c74525.

Actions #70

Updated by Roel van Meer about 1 year ago

I've been looking at the logs and trying to make sense of them. I think the relevant section is:

snap_mapper.run
snap_mapper._parse_p purged_snaps pool 15213 [1,c0)
snap_mapper._parse_m mapping pool 15213 snap d7 shard 255 15213:2b7d2064:::rbd_data.edab6e4267a35a.0000000000000003:d7
snap_mapper._parse_p purged_snaps pool 15214 [1,f7)
snap_mapper.run ok 15213:2b7d2064:::rbd_data.edab6e4267a35a.0000000000000003:d7 snap d7 precedes pool 15214 purged_snaps [1,f7)
snap_mapper._parse_m mapping pool 15217 snap 7 shard 255 15217:00a7322e:::rbd_data.98301b80c74525.000000000000000e:7
snap_mapper._parse_p purged_snaps pool 15215 [1,f7)
snap_mapper._parse_p purged_snaps pool 15216 [1,f7)
snap_mapper._parse_p purged_snaps pool 15217 [1,f7)
snap_mapper.run stray 15217:00a7322e:::rbd_data.98301b80c74525.000000000000000e:7 snap 7 in pool 15217 shard 255 purged_snaps [1,f7)
snap_mapper._parse_m mapping pool 15217 snap 7 shard 255 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7

Trying to correlate that with this piece of code from SnapMapper::Scrubber::run():

if (mapping.hoid.pool < pool ||
mapping.snap < begin) {

It seems this if() is false, because mapping.hoid.pool is equal to pool(15217), and mapping.snap(7) is not smaller than begin(1), so that's why these are marked as strays. Unfortunately that's about as far as I get.

Actions #71

Updated by Roel van Meer about 1 year ago

Hi! Could I politely ask for some input on this issue, please?

Thanks in advance, Roel

Actions #72

Updated by Radoslaw Zarzynski about 1 year ago

Hi Roel! I've missed your back. Looking into the new findings!

Actions #73

Updated by Radoslaw Zarzynski about 1 year ago

I went over the SnapMapper::Scrubber's code and the log with the additions.

All the scrubber does is comparing the purged snap intervals with the mappings of SnapMapper (stored in OMAP of make_purged_snaps_oid() and make_snapmapper_oid() respectively). These are its inputs and I think Scrubber behaved accordingly to them -- it requested retrimming of the snap 7 because it's simply between 1 and 0xf7 for pool 15217.

The question are:

1. is the scrubber's input correct? I'm particularly interested in the purgned snap intervals. They are created basing on the data from monitors and stored locally after processing by the pretty complex:

  void SnapMapper::record_purged_snaps(
    CephContext *cct,
    ObjectStore *store,
    ObjectStore::CollectionHandle& ch,
    ghobject_t hoid,
    ObjectStore::Transaction *t,
    map<epoch_t,mempool::osdmap::map<int64_t,snap_interval_set_t>> purged_snaps)
  { 

To answer this question OSDs logs around the moment of snap deletion request would be useful.

2. Should the clone object be deleted? This decision is made by PrimaryLogPG::trim_object() (de facto part of the SnapTrimmer) basing on whether the trimmed snap is the very last one served by the clone object. Here 7 was perceived so (snaps 7 -> ... deleting) and the underlying clone has been removed.

=524009) [56,19,78] r=0 lpr=524009 luod=524017'1306 crt=524086'1308 lcod 524017'1304 mlcod 524017'1304 active+clean+snaptrim trimq=[7~1](7)] SnapTrimmer state<Trimming/AwaitAsyncWork>: AwaitAsyncWork react trimm
ing 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 osd.56 pg_epoch: 524086 pg[15217.0( v 524086'1308 (0'0,524086'1308] local-lis/les=524009/524010 n=317 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis
=524009) [56,19,78] r=0 lpr=524009 luod=524017'1306 crt=524086'1308 lcod 524017'1304 mlcod 524017'1304 active+clean+snaptrim trimq=[7~1](7)] 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7 old_snaps 
7 old snapset 7=[]:{7=[7]}
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 osd.56 pg_epoch: 524086 pg[15217.0( v 524086'1308 (0'0,524086'1308] local-lis/les=524009/524010 n=317 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis
=524009) [56,19,78] r=0 lpr=524009 luod=524017'1306 crt=524086'1308 lcod 524017'1304 mlcod 524017'1304 active+clean+snaptrim trimq=[7~1](7)] 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7 snaps 7 ->
  ... deleting
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 osd.56 pg_epoch: 524086 pg[15217.0( v 524086'1308 (0'0,524086'1308] local-lis/les=524009/524010 n=317 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis
=524009) [56,19,78] r=0 lpr=524009 luod=524017'1306 crt=524086'1308 lcod 524017'1304 mlcod 524017'1304 active+clean+snaptrim trimq=[7~1](7)] 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7 new snapse
t 7=[]:{} on 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:head(524017'1261 client.2103650080.0:45169 dirty|data_digest s 4194304 uv 1261 dd bc7892d3 alloc_hint [4194304 4194304 0])
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 osd.56 pg_epoch: 524086 pg[15217.0( v 524086'1308 (0'0,524086'1308] local-lis/les=524009/524010 n=317 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis
=524009) [56,19,78] r=0 lpr=524009 luod=524017'1306 crt=524086'1308 lcod 524017'1304 mlcod 524017'1304 active+clean+snaptrim trimq=[7~1](7)] 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7 writing up
dated snapset on 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:head, snapset is 7=[]:{}
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 snap_mapper.remove_oid 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 snap_mapper._remove_oid 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 snap_mapper.get_snaps 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7 7
2023-04-05T03:02:44.618+0200 7f0333cfe700  3 snap_mapper.clear_snaps 15217:00fe1408:::rbd_data.98301b80c74525.0000000000000022:7

However, is this correct? Is there any other snap missed in the new_snaps? To be conclusive the life-time of the object would need to be tracked.

Actions #74

Updated by Roel van Meer about 1 year ago

Radoslaw Zarzynski wrote:

I went over the SnapMapper::Scrubber's code and the log with the additions.

Thanks again for looking at this!

I have a few follow-up questions regarding your last questions.

1. is the scrubber's input correct? I'm particularly interested in the purgned snap intervals. They are created basing on the data from monitors and stored locally after processing by the pretty complex:

[...]

To answer this question OSDs logs around the moment of snap deletion request would be useful.

The logs of SnapMapper::record_purged_snaps() are already included in the last logfile. If this is not enough, then what do we need? Would a debug_osd=10 log be sufficient?

2. Should the clone object be deleted? This decision is made by PrimaryLogPG::trim_object() (de facto part of the SnapTrimmer) basing on whether the trimmed snap is the very last one served by the clone object. Here 7 was perceived so (snaps 7 -> ... deleting) and the underlying clone has been removed.

[...]
To be conclusive the life-time of the object would need to be tracked.

What would be required to track the life-time of the object, with regarding to subsystems and log levels?

Lastly, is there a way to trigger this scrub/trim process for the OSD under investigation? If running with high debug levels, I'd rather do this shorter than longer.

Actions #75

Updated by Radoslaw Zarzynski about 1 year ago

The logs of SnapMapper::record_purged_snaps() are already included in the last logfile. If this is not enough, then what do we need? Would a debug_osd=10 log be sufficient?

They are but for different pools, not for 15217 i grepped for :-(. Hmm, do you recall by any chance at which exact time the rmsnap-7-on-15217 has been requested?

What would be required to track the life-time of the object, with regarding to subsystems and log levels?

debug_osd at 10 or maybe even 20 but increasing it enough might be not an option, I'm afraid. Likely we'll need to be more selective. PrimaryLogPG::make_writeable() would be the first interesting thing to take a look on.

Lastly, is there a way to trigger this scrub/trim process for the OSD under investigation? If running with high debug levels, I'd rather do this shorter than longer.

Yes! There are two ways how the scrubber is triggered:
1. The main one at the OSD's tick() method, depending on a couple of configurables:

void OSD::tick()
{
  // ...

  // scrub purged_snaps every deep scrub interval
  {
    const utime_t last = superblock.last_purged_snaps_scrub;
    utime_t next = last;
    next += cct->_conf->osd_scrub_min_interval;
    std::mt19937 rng;
    // use a seed that is stable for each scrub interval, but varies
    // by OSD to avoid any herds.
    rng.seed(whoami + superblock.last_purged_snaps_scrub.sec());
    double r = (rng() % 1024) / 1024.0;
    next +=
      cct->_conf->osd_scrub_min_interval *
      cct->_conf->osd_scrub_interval_randomize_ratio * r;
    if (next < ceph_clock_now()) {
      dout(20) << __func__ << " last_purged_snaps_scrub " << last
               << " next " << next << " ... now" << dendl;
      scrub_purged_snaps();
    } else {
      dout(20) << __func__ << " last_purged_snaps_scrub " << last
               << " next " << next << dendl;
    }
  }

2. A dedicated admin command :-).

void OSD::asok_command(
  std::string_view prefix, const cmdmap_t& cmdmap,
  Formatter *f,
  const bufferlist& inbl,
  std::function<void(int,const std::string&,bufferlist&)> on_finish)
{
  // ...
  else if (prefix == "scrub_purged_snaps") {
    lock_guard l(osd_lock);
    scrub_purged_snaps();
  }
Actions #76

Updated by Roel van Meer about 1 year ago

Ah, that last command was very helpful, running it triggers the problem immediately!

So, I have for you: a log file with debug_osd at 10, and one at 20, both including the entire reproduction from creating the image, filling the image, making the snapshot, and overwriting the image.

The level 10 debug was made on pool 15217, rbd id dcbf9929030c6e.
The level 20 debug was made on pool 15217, rbd id f649b1de68a46f.

Since the files are too big for the bug tracker, I've uploaded them here:
https://1afa.com/ceph/ceph-osd.56-20230414-debug10.log.gz
https://1afa.com/ceph/ceph-osd.56-20230414-debug20.log.gz

Please note that this was done with the customised ceph-osd binary, so several debug messages that originally were level 10 or 20 are now still level 3. They are included in these files, but they may have a different log level than you're used to.

Actions #77

Updated by Roel van Meer about 1 year ago

Hi Radoslaw!

You're undoubtedly very busy, but have you had any chance to look at the debug logs?
Since we can now reproduce the problem very quickly, creating a debug log with debug of other subsystems set to 20 is also possible. Please let me know if we have to provide additional logs or info.

Thanks a lot! Roel

Actions #78

Updated by Radoslaw Zarzynski about 1 year ago

Hi Roel!

Sorry for lagging on this. Yesterday I've got back from the Cephalocon 2023.

Life history of the rbd_data.f649b1de68a46f.0000000000000010 object =====================================================================

Here it got created:

2023-04-14T16:00:12.675+0200 7f032fcf6700 20 osd.56 pg_epoch: 525166 pg[15217.0( v 525166'2403 (0'0,525166'2403] local-lis/les=524009/524010 n=395 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525166'2403 lcod 525166'2402 mlcod 525166'2402 active+clean] do_op: op osd_op(client.2103650080.0:470197 15217.0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head [set-alloc-hint object_size 4194304 write_size 4194304,writefull 0~4194304 in=4194304b] snapc 0=[] ondisk+write+known_if_redirected e525166) v8
...
2023-04-14T16:00:12.675+0200 7f032fcf6700 10 osd.56 pg_epoch: 525166 pg[15217.0( v 525166'2403 (0'0,525166'2403] local-lis/les=524009/524010 n=395 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525166'2403 lcod 525166'2402 mlcod 525166'2402 active+clean] get_object_context: no obc for soid 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head but can_create
...
2023-04-14T16:00:12.675+0200 7f032fcf6700 10 osd.56 pg_epoch: 525166 pg[15217.0( v 525166'2403 (0'0,525166'2403] local-lis/les=524009/524010 n=395 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525166'2403 lcod 525166'2402 mlcod 525166'2402 active+clean]  final snapset 0=[]:{} in 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head

The response for client:

2023-04-14T16:00:12.683+0200 7f0333cfe700 10 osd.56 pg_epoch: 525166 pg[15217.0( v 525166'2404 (0'0,525166'2404] local-lis/les=524009/524010 n=396 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525166'2404 lcod 525166'2403 mlcod 525166'2403 active+clean]  sending reply on osd_op(client.2103650080.0:470197 15217.0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head [set-alloc-hint object_size 4194304 write_size 4194304,writefull 0~4194304 in=4194304b] snapc 0=[] ondisk+write+known_if_redirected e525166) v8 0x5623552fd680

A read with snapcontext.seq=0:

023-04-14T16:00:33.484+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2485 (0'0,525167'2485] local-lis/les=524009/524010 n=430 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2485 lcod 525167'2484 mlcod 525167'2484 active+clean] do_op: op osd_op(client.2103650080.0:470333 15217.0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e525167) v8

Read against the explicitly set a snapid:

2023-04-14T16:00:35.836+0200 7f032fcf6700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2485 (0'0,525167'2485] local-lis/les=524009/524010 n=430 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2485 lcod 525167'2484 mlcod 525167'2484 active+clean] do_op: op osd_op(client.2103650080.0:470383 15217.0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e525167) v8
...
2023-04-14T16:00:35.836+0200 7f032fcf6700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2485 (0'0,525167'2485] local-lis/les=524009/524010 n=430 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2485 lcod 525167'2484 mlcod 525167'2484 active+clean] find_object_context 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a @a snapset 0=[]:{}
2023-04-14T16:00:35.836+0200 7f032fcf6700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2485 (0'0,525167'2485] local-lis/les=524009/524010 n=430 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2485 lcod 525167'2484 mlcod 525167'2484 active+clean] get_object_context: found obc in cache: 0x5623a80438c0
2023-04-14T16:00:35.836+0200 7f032fcf6700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2485 (0'0,525167'2485] local-lis/les=524009/524010 n=430 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2485 lcod 525167'2484 mlcod 525167'2484 active+clean] get_object_context: 0x5623a80438c0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head rwstate(none n=0 w=0) oi: 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head(525166'2404 client.2103650080.0:470197 dirty|data_digest s 4194304 uv 2404 dd 635d715e alloc_hint [4194304 4194304 0]) exists: 1 ssc: 0x5623da64f8c0 snapset: 0=[]:{}
2023-04-14T16:00:35.836+0200 7f032fcf6700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2485 (0'0,525167'2485] local-lis/les=524009/524010 n=430 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2485 lcod 525167'2484 mlcod 525167'2484 active+clean] find_object_context 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head want a > snapset seq 0 -- HIT 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head

Finally writeful happens, client sent SnapContext with SnapContext.seq=a

2023-04-14T16:00:50.136+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] do_op: op osd_op(client.2103650080.0:470465 15217.0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head [writefull 0~4194304 in=4194304b] snapc a=[a] ondisk+write+known_if_redirected e525167) v8
...
2023-04-14T16:00:50.136+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] get_object_context: obc NOT found in cache: 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head
...
2023-04-14T16:00:50.136+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] get_object_context: creating obc from disk: 0x56236ab53600
2023-04-14T16:00:50.136+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] get_object_context: 0x56236ab53600 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head rwstate(none n=0 w=0) oi: 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head(525166'2404 client.2103650080.0:470197 dirty|data_digest s 4194304 uv 2404 dd 635d715e alloc_hint [4194304 4194304 0]) exists: 1 ssc: 0x562369efbce0 snapset: 0=[]:{}
2023-04-14T16:00:50.136+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] find_object_context 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head @head oi=15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head(525166'2404 client.2103650080.0:470197 dirty|data_digest s 4194304 uv 2404 dd 635d715e alloc_hint [4194304 4194304 0])
...
2023-04-14T16:00:50.136+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] execute_ctx 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head [writefull 0~4194304 in=4194304b] ov 525166'2404 av 525167'2535 snapc a=[a] snapset 0=[]:{}
###
2023-04-14T16:00:50.136+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] make_writeable 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head snapset=0=[]:{}  snapc=a=[a]
###
2023-04-14T16:00:50.136+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean]  got greedy write on clone_obc obc(15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a rwstate(write n=1 w=0))
2023-04-14T16:00:50.136+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean]  cloning v 525166'2404 to 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a v 525167'2535 snaps=[a] snapset=0=[]:{a=[a]}
2023-04-14T16:00:50.136+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean] make_writeable 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head done, snapset=a=[]:{a=[a]}
...
2023-04-14T16:00:50.136+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2534 (0'0,525167'2534] local-lis/les=524009/524010 n=446 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2534 lcod 525167'2533 mlcod 525167'2533 active+clean]  final snapset a=[]:{a=[a]} in 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head

As a part of handling this op SnapMapper gets new mapping:

2023-04-14T16:00:50.136+0200 7f0333cfe700  3 snap_mapper.add_oid 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a a
2023-04-14T16:00:50.136+0200 7f0333cfe700  3 snap_mapper.get_snaps 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a got.empty()
2023-04-14T16:00:50.136+0200 7f0333cfe700  3 snap_mapper.set_snaps 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a a
2023-04-14T16:00:50.136+0200 7f0333cfe700  3 snap_mapper.set_snaps set OBJ_0000000000003B71.0CBEB35E.a.rbd%udata%ef649b1de68a46f%e0000000000000010..
2023-04-14T16:00:50.136+0200 7f0333cfe700  3 snap_mapper.add_oid set SNA_15217_000000000000000A_0000000000003B71.0CBEB35E.a.rbd%udata%ef649b1de68a46f%e0000000000000010..
2023-04-14T16:00:50.136+0200 7f0333cfe700 20 earliest_dup_version = 0
2023-04-14T16:00:50.136+0200 7f0333cfe700 20 trim 525167'2535 (525166'2404) clone    15217:3d7dca70:::rbd_data.f649b1de68a46f.0000000000000010:a by unknown.0.0:0 2023-04-14T16:00:12.676685+0200 0 snaps [a] ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0
2023-04-14T16:00:50.136+0200 7f0333cfe700 20 trim 525167'2536 (525166'2404) modify   15217:3d7dca70:::rbd_data.f649b1de68a46f.0000000000000010:head by client.2103650080.0:470465 2023-04-14T16:00:50.135954+0200 0 ObjectCleanRegions clean_offsets: [4194304~18446744073705357311], clean_omap: 1, new_object: 0

Another read happens on head:

2023-04-14T16:00:58.304+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean] do_op: op osd_op(client.2103650080.0:470558 15217.0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e525167) v8
...
2023-04-14T16:00:58.304+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean] get_object_context: 0x56234197c000 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head rwstate(none n=0 w=0) oi: 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head(525167'2536 client.2103650080.0:470465 dirty|data_digest s 4194304 uv 2536 dd 4b0e55fc alloc_hint [4194304 4194304 0]) exists: 1 ssc: 0x5623896d4c60 snapset: a=[]:{a=[a]}

... and yet another one:

2023-04-14T16:01:00.688+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean] do_op: op osd_op(client.2103650080.0:470608 15217.0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a [read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e525167) v8
...
2023-04-14T16:01:00.688+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean] get_object_context: 0x5623a80422c0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a rwstate(none n=0 w=0) oi: 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a(525167'2535 client.2103650080.0:470197 dirty|data_digest s 4194304 uv 2404 dd 635d715e alloc_hint [0 0 0]) exists: 1 ssc: 0x56238d066000 snapset: a=[]:{a=[a]}
2023-04-14T16:01:00.688+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean] find_object_context 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a clone_snaps [a] contains a -- HIT 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a

Now the scrubber comes into play and sees the same (?!) purged snap range [1, f7):

2023-04-14T16:01:14.096+0200 7f034c7c9700  3 snap_mapper.run ok 15213:2b7d2064:::rbd_data.edab6e4267a35a.0000000000000003:d7 snap d7 precedes pool 15214 purged_snaps [1,f7)
2023-04-14T16:01:14.096+0200 7f034c7c9700  3 snap_mapper._parse_m mapping pool 15217 snap a shard 255 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a
2023-04-14T16:01:14.096+0200 7f03304f7700 20 osd.56 op_wq(7) _process empty q, waiting
2023-04-14T16:01:14.096+0200 7f034c7c9700  3 snap_mapper._parse_p purged_snaps pool 15215 [1,f7)
2023-04-14T16:01:14.096+0200 7f034c7c9700  3 snap_mapper._parse_p purged_snaps pool 15216 [1,f7)
2023-04-14T16:01:14.096+0200 7f034c7c9700  3 snap_mapper._parse_p purged_snaps pool 15217 [1,f7)
2023-04-14T16:01:14.096+0200 7f034c7c9700  3 snap_mapper.run stray 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a snap a in pool 15217 shard 255 purged_snaps [1,f7)
...
2023-04-14T16:01:14.096+0200 7f034c7c9700  3 snap_mapper.run stray 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a snap a in pool 15217 shard 255 purged_snaps [1,f7)

SnapTrimmer kills it:

2023-04-14T16:01:14.096+0200 7f0333cfe700 10 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean+snaptrim trimq=[a~1](a)] get_object_context: 0x5623a80422c0 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a rwstate(none n=0 w=0) oi: 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a(525167'2535 client.2103650080.0:470197 dirty|data_digest s 4194304 uv 2404 dd 635d715e alloc_hint [0 0 0]) exists: 1 ssc: 0x56238d066000 snapset: a=[]:{a=[a]}
...
2023-04-14T16:01:14.096+0200 7f0333cfe700  3 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean+snaptrim trimq=[a~1](a)] 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a old_snaps a old snapset a=[]:{a=[a]}
...
2023-04-14T16:01:14.096+0200 7f0333cfe700 20 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean+snaptrim trimq=[a~1](a)] simple_opc_create 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a
2023-04-14T16:01:14.096+0200 7f0333cfe700  3 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean+snaptrim trimq=[a~1](a)] 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a snaps a ->  ... deleting
2023-04-14T16:01:14.096+0200 7f0333cfe700  3 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean+snaptrim trimq=[a~1](a)] 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a new snapset a=[]:{} on 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head(525167'2536 client.2103650080.0:470465 dirty|data_digest s 4194304 uv 2536 dd 4b0e55fc alloc_hint [4194304 4194304 0])
2023-04-14T16:01:14.096+0200 7f0333cfe700  3 osd.56 pg_epoch: 525167 pg[15217.0( v 525167'2635 (0'0,525167'2635] local-lis/les=524009/524010 n=480 ec=521769/521769 lis/c=524009/524009 les/c/f=524010/524010/0 sis=524009) [56,19,78] r=0 lpr=524009 crt=525167'2635 lcod 525167'2633 mlcod 525167'2633 active+clean+snaptrim trimq=[a~1](a)] 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a writing updated snapset on 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:head, snapset is a=[]:{}

New let's grep for the record_purged_snaps:

$ grep record_purged_snaps ceph-osd.56-20230414-debug20.log 
2023-04-14T16:00:12.223+0200 7f03417ce700  3 snap_mapper.record_purged_snaps purged_snaps {525166={}}
2023-04-14T16:00:12.223+0200 7f03417ce700  3 snap_mapper.record_purged_snaps rm 0 keys, set 0 keys
2023-04-14T16:00:20.851+0200 7f03417ce700  3 snap_mapper.record_purged_snaps purged_snaps {525167={}}
2023-04-14T16:00:20.851+0200 7f03417ce700  3 snap_mapper.record_purged_snaps rm 0 keys, set 0 keys
2023-04-14T16:01:15.044+0200 7f03417ce700  3 snap_mapper.record_purged_snaps purged_snaps {525168={}}
2023-04-14T16:01:15.044+0200 7f03417ce700  3 snap_mapper.record_purged_snaps rm 0 keys, set 0 keys

Is the recorded entry for 15217 stale? Why the range is so wide?
(assuming everything happened on the pool 15217 AND the clone 15217:03d7dca7:::rbd_data.f649b1de68a46f.0000000000000010:a still should be there)

A question: how exactly the snap removal has been requested?

Actions #79

Updated by Radoslaw Zarzynski about 1 year ago

The debug from the record_purged_snaps are:

void SnapMapper::record_purged_snaps(
  CephContext *cct,
  ObjectStore *store,
  ObjectStore::CollectionHandle& ch,
  ghobject_t hoid,
  ObjectStore::Transaction *t,
  map<epoch_t,mempool::osdmap::map<int64_t,snap_interval_set_t>> purged_snaps)
{
  dout(10) << __func__ << " purged_snaps " << purged_snaps << dendl;
  map<string,bufferlist> m;
  set<string> rm;
  for (auto& [epoch, bypool] : purged_snaps) {
    // index by (pool, snap)
    for (auto& [pool, snaps] : bypool) {
      for (auto i = snaps.begin();
           i != snaps.end();
           ++i) {
...
  t->omap_rmkeys(ch->cid, hoid, rm);
  t->omap_setkeys(ch->cid, hoid, m);
  dout(10) << __func__ << " rm " << rm.size() << " keys, set " << m.size()
           << " keys" << dendl;
}

The input comes from OSDMaps, so we might be able to dig into them (hopefully they aren't trimmed yet) to reconstruct the stimulus to record_purged_snaps.

void OSD::handle_osd_map(MOSDMap *m)
{
...
  // store new maps: queue for disk and put in the osdmap cache
  epoch_t start = std::max(superblock.newest_map + 1, first);
  for (epoch_t e = start; e <= last; e++) {
    if (txn_size >= t.get_num_bytes()) {
      derr << __func__ << " transaction size overflowed" << dendl;
      ceph_assert(txn_size < t.get_num_bytes());
    }
    txn_size = t.get_num_bytes();
    map<epoch_t,bufferlist>::iterator p;
    p = m->maps.find(e);
    if (p != m->maps.end()) {
      dout(10) << "handle_osd_map  got full map for epoch " << e << dendl;
      OSDMap *o = new OSDMap;
      bufferlist& bl = p->second;

      o->decode(bl);

      purged_snaps[e] = o->get_new_purged_snaps();

Could you please create a snap on 15217, make some IO to trigger generating a clone and run the scrubber without removing the snap? If the hypothesis is correct (and the snap seq isn't too high), we should hit a replication.

Could you also describe the procedure once again? I'm pretty sure it's already in one of the comments; I just want to ensure we're on the same page when it comes to test scenario.

Actions #80

Updated by Roel van Meer about 1 year ago

Radoslaw Zarzynski wrote:

Could you also describe the procedure once again? I'm pretty sure it's already in one of the comments; I just want to ensure we're on the same page when it comes to test scenario.

Currently the procedure is as follows. I use one ceph node as a client to map the rbd images on.

2023-04-14 16:00:00 rbd -p test4 create -s 200M test4-disk7
2023-04-14 16:00:04 rbd -p test4 map test4-disk7
2023-04-14 16:00:12 dd if=/dev/urandom of=/dev/rbd8 oflag=direct bs=4M count=50
2023-04-14 16:00:20 rbd -p test4 snap create test4-disk7@snapshot
2023-04-14 16:00:27 rbd -p test4 map test4-disk7@snapshot
2023-04-14 16:00:49 dd if=/dev/urandom of=/dev/rbd8 oflag=direct bs=4M count=50

And on the ceph node having osd.56:

2023-04-14 15:59:51 ceph tell osd.56 config set debug_osd 20/20 
2023-04-14 16:01:13 ceph daemon osd.56 scrub_purged_snaps
2023-04-14 16:01:47 ceph tell osd.56 config set debug_osd 1/1

So I never give any command to remove the snap. rbd info also still reports snapshot_count as 1. Also, rbd du reports the snapshot taking up space:

~# rbd -p test4 du test4-disk7
NAME PROVISIONED USED
test4-disk7@snapshot 200 MiB 200 MiB
test4-disk7 200 MiB 200 MiB
TOTAL 200 MiB 400 MiB

Could you please create a snap on 15217, make some IO to trigger generating a clone and run the scrubber without removing the snap? If the hypothesis is correct (and the snap seq isn't too high), we should hit a replication.

Unless I'm missing something, the latest debug log is exactly that. Please let me know if I misunderstood, or if you need additional test runs.

Thanks again! I hope Cephalocon was nice, I would have liked to be there but unfortunately I didn't have time.

Actions #81

Updated by Radoslaw Zarzynski about 1 year ago

So I never give any command to remove the snap.

and

Unless I'm missing something, the latest debug log is exactly that. Please let me know if I misunderstood, or if you need additional test runs.

Thank you very much for the clarification! I wasn't sure how many snaps were supposed to be served by the killed clone.
This definitely redefines our question to: why the purged_snaps entries are corrupted?
I think that SnapMapper and SnapTrimmer are fine. The problem is likely in record_purged_snaps or maybe even in monitors. I will read this code.

Actions #82

Updated by Radoslaw Zarzynski about 1 year ago

I think we could start sniffing around OSDMaps.

[rzarzynski@o06 build]$ bin/ceph osd getmap -o osdmap_last
got osdmap epoch 15
[rzarzynski@o06 build]$ bin/osdmaptool --dump json-pretty osdmap_last | grep new_purged_snaps
bin/osdmaptool: osdmap file 'osdmap_last'
    "new_purged_snaps": [],
[rzarzynski@o06 build]$ bin/ceph osd getmap 14 -o osdmap_14
got osdmap epoch 14
[rzarzynski@o06 build]$ bin/osdmaptool --dump json-pretty osdmap_14 | grep new_purged_snaps
bin/osdmaptool: osdmap file 'osdmap_14'
    "new_purged_snaps": [],
[rzarzynski@o06 build]$ # ... and so on.

Do you recall how long ago (ideally: how many epochs ago) the pool has been created?

Actions #83

Updated by Roel van Meer about 1 year ago

Radoslaw Zarzynski wrote:

Do you recall how long ago (ideally: how many epochs ago) the pool has been created?

Yes, a bit over 4000. We make a copy of the output of ceph osd dump four times a day, which goes into our backups, so I can pretty accurately determine when the pool was created.

I'm getting the osdmaps now, will post the results tomorrow morning first thing!

Actions #84

Updated by Radoslaw Zarzynski about 1 year ago

Yes, a bit over 4000. We make a copy of the output of ceph osd dump four times a day, which goes into our backups, so I can pretty accurately determine when the pool was created.

Wonderful. Thanks, Roel!

Actions #85

Updated by Roel van Meer about 1 year ago

Good morning!

Well, after checking the osdmaps I can say that new_purged_snaps is empty in almost all of the osdmaps, and certainly in the osdmaps that were created during the test runs.

The osdmaps relevant to the latest test are 525166-525168. Between 525166 and 525167 I see snap_seq increment and updates to the last_change and snap_epoch fields. Between 525167 and 525168 the last_purged_snaps_scrub timestamp updates. That's it.

Because I don't really know what you're looking for I uploaded all of the decoded osdmaps to https://1afa.com/ceph/1afa-osdmaps-20230420.tar.xz

Please let me know if I need to look at something else!

Actions #86

Updated by Radoslaw Zarzynski about 1 year ago

Thanks for the data and for scripting it, Roel!

The story of the pool 15217 begins at epoch 521769.

$ less decoded/osdmap_521769.json
...
        {
            "pool": 15217,
            "pool_name": "test4",
            "create_time": "2023-03-15T19:28:35.068043+0100",
            "flags": 32769,
            "flags_names": "hashpspool,creating",

there is just one very single OSDMap about new_purged_snaps on 15217:

$ cat ./osdmap_521783.out
{
  "epoch": 521783,
  "modified": "2023-03-15T19:39:53.850227+0100",
  "new_purged_snaps": [
    {
      "pool": 15217,
      "snaps": [
        {
          "begin": 2,
          "length": 1
        }
      ]
    }
  ]
}

Let's see how the snap intervals are encoded:

    mempool::osdmap::map<int64_t, snap_interval_set_t> new_removed_snaps;
    mempool::osdmap::map<int64_t, snap_interval_set_t> new_purged_snaps;
typedef interval_set<
  snapid_t,
  mempool::osdmap::flat_map> snap_interval_set_t;

I'm struggling with grep out quickly the exact json dumper but operator<< should be enough (same data, just different format):

template<typename T, template<typename, typename, typename ...> class C>
inline std::ostream& operator<<(std::ostream& out, const interval_set<T,C> &s) {
  out << "[";
  bool first = true;
  for (const auto& [start, len] : s) {
    if (!first) out << ",";
    out << start << "~" << len;
    first = false;
  }
  out << "]";
  return out;
}

The contains method is:

  bool contains(T i, T *pstart=0, T *plen=0) const {
    auto p = find_inc(i);
    if (p == m.end()) return false;
    if (p->first > i) return false;
    if (p->first+p->second <= i) return false;
    ceph_assert(p->first <= i && p->first+p->second > i);
    if (pstart)
      *pstart = p->first;
    if (plen)
      *plen = p->second;
    return true;
  }

This means the mon-agreed interval of already trimmed snaps on the pool is [2, 3), not [1, 7f). If so, the corruption is at OSD side which nicely cuts monitors out from the investigation, at least for now :-).

Will be digging into record_purged_snaps.

It might be worth to look in OSD's OMAP for other similar (by 7f?) entries on purged snaps. I wouldn't be surprised if there are more junks. I will post details in the evening / weekend.

Actions #87

Updated by Radoslaw Zarzynski about 1 year ago

The purged snap info lives in OMAP of single, per-OSD object but I doubt it's accessible via the standard CLI tooling (rados CLI).

const char *SnapMapper::PURGED_SNAP_PREFIX = "PSN_";

// -- purged snaps --

string SnapMapper::make_purged_snap_key(int64_t pool, snapid_t last)
{
  char k[80];
  snprintf(k, sizeof(k), "%s_%llu_%016llx", PURGED_SNAP_PREFIX,
       (unsigned long long)pool, (unsigned long long)last);
  return k;
}

void SnapMapper::make_purged_snap_key_value(
  int64_t pool, snapid_t begin, snapid_t end, map<string,bufferlist> *m)
{
  string k = make_purged_snap_key(pool, end - 1);
  auto& v = (*m)[k];
  ceph::encode(pool, v);
  ceph::encode(begin, v);
  ceph::encode(end, v);
}

void SnapMapper::record_purged_snaps(
  CephContext *cct,
  ObjectStore *store,
  ObjectStore::CollectionHandle& ch,
  ghobject_t hoid,
  ObjectStore::Transaction *t,
  map<epoch_t,mempool::osdmap::map<int64_t,snap_interval_set_t>> purged_snaps)
{
  dout(10) << __func__ << " purged_snaps " << purged_snaps << dendl;
...
  t->omap_rmkeys(ch->cid, hoid, rm);
  t->omap_setkeys(ch->cid, hoid, m);
  dout(10) << __func__ << " rm " << rm.size() << " keys, set " << m.size()
       << " keys" << dendl;
}

void OSD::handle_osd_map(MOSDMap *m)
{
...
  // record new purged_snaps
  if (superblock.purged_snaps_last == start - 1) {
    SnapMapper::record_purged_snaps(cct, store, service.meta_ch,
                    make_purged_snaps_oid(), &t,
                    purged_snaps);
    superblock.purged_snaps_last = last;
  } else {
    dout(10) << __func__ << " superblock purged_snaps_last is " 
         << superblock.purged_snaps_last
         << ", not recording new purged_snaps" << dendl;
  }
...
}

class OSD : public Dispatcher,
        public md_config_obs_t {
...
  static ghobject_t make_purged_snaps_oid() {
    return ghobject_t(hobject_t(
      sobject_t(
    object_t("purged_snaps"),
    0)));
  }
Actions #88

Updated by Roel van Meer about 1 year ago

Radoslaw Zarzynski wrote:

The purged snap info lives in OMAP of single, per-OSD object but I doubt it's accessible via the standard CLI tooling (rados CLI).

What does this mean? Are we stuck?
I doubt I am capable of hacking together a tool that can read these values on the fly. We're getting above my pay grade here. :)

Actions #89

Updated by Radoslaw Zarzynski almost 1 year ago

Nope, not yet :-).
The problem is the testing could become more invasive – for instance, employing the ceph-objectstore-tool requires stopping the OSD.
COT is already capable of dumping object's OMAP:

void usage(po::options_description &desc)
{
...
    cerr << "ceph-objectstore-tool ... <object> (get|rm)-(attr|omap) <key>" << std::endl;
...
    cerr << "ceph-objectstore-tool ... <object> list-omap" << std::endl;
...
    cerr << "ceph-objectstore-tool ... <object> dump" << std::endl;

The exact object with purged snaps entries in its OMAP is:

    return ghobject_t(hobject_t(
      sobject_t(
    object_t("purged_snaps"),
    0)));
Actions #90

Updated by Roel van Meer almost 1 year ago

Radoslaw Zarzynski wrote:

The exact object with purged snaps entries in its OMAP is:

I'm sorry to say that I haven't been able to figure it out yet. We're trying to get at the correct objects (on our test cluster), but no success so far.

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1 --op list

gives (abbreviated):

Error getting attr on : 14.63_head,#-16:c6000000:::scrub_14.63:head#, (61) No data available
["14.54",{"oid":"rbd_object_map.6dfb81894e9a69","key":"","snapid":-2,"hash":3124384212,"max":0,"pool":14,"namespace":"","max":0}]
["14.54",{"oid":"rbd_data.091ca57634970b.0000000000000792","key":"","snapid":-2,"hash":4240443860,"max":0,"pool":14,"namespace":"","max":0}]
["14.63",{"oid":"scrub_14.63","key":"","snapid":-2,"hash":99,"max":0,"pool":-16,"namespace":"","max":0}]
["14.63",{"oid":"rbd_data.091ca57634970b.000000000000025f","key":"","snapid":-2,"hash":1553388643,"max":0,"pool":14,"namespace":"","max":0}]
["14.65",{"oid":"rbd_id.vm-10401-disk-0","key":"","snapid":-2,"hash":343770853,"max":0,"pool":14,"namespace":"","max":0}]
["14.65",{"oid":"rbd_data.091ca57634970b.00000000000004ea","key":"","snapid":-2,"hash":727086821,"max":0,"pool":14,"namespace":"","max":0}]
...

So only objects named rbd_object_map.*, rbd_data.*, rbd_id.* and scrub_*.*, but nothing resembling purged_snaps.
The actual OSD under investigation returns similar results.

Could you please elaborate on how to get at the purged_snaps data?

Actions #91

Updated by Ilya Dryomov almost 1 year ago

Radek, just in case you are curious where that one actually removed snapshot is coming from even though Roel doesn't issue any snapshot removal commands: the first "rbd create" command on a pool allocates and immediately releases a dummy self-managed snapshot ID as part of validating that the pool is suitable for use as an RBD data pool. As a (intended) side effect, the pool switches itself into selfmanaged_snaps mode and the corresponding flag gets set.

Roel, could you please append "--rbd-validate-pool false" to "rbd create" command and repeat the test (on a fresh pool, as usual). This should tell us whether it's the dummy snapshot removal that is the trigger or it's that the record of purged snaps is somehow corrupted to begin with.

Actions #92

Updated by Roel van Meer almost 1 year ago

Hi Ilya!!

When I create the first rbd image with the --rbd-validate-pool false option, the problem doesn't occur, even after calling ceph daemon osd.x scrub_purged_snaps. However, if I then create and remove a snapshot on the pool by hand, the problem resurfaces. So it seems the problem only happens after the first snapshot is removed. Whether that happens as part of the pool validation or by hand doesn't seem to matter.

Actions #93

Updated by Radoslaw Zarzynski 12 months ago

I'm sorry to say that I haven't been able to figure it out yet. We're trying to get at the correct objects (on our test cluster), but no success so far.

Hi Roel!
I'm sorry for missing description last time.

The problem with cbt is that list skips the meta collection; list-meta should be used instead.

On my dev cluster:

$ bin/ceph-objectstore-tool --no-mon-config --data-path dev/osd0 --op meta-list | grep purg
["meta",{"oid":"purged_snaps","key":"","snapid":0,"hash":1983890819,"max":0,"pool":-1,"namespace":"","max":0}]

Further steps:

$ bin/ceph-objectstore-tool --no-mon-config --data-path dev/osd0  '["meta",{"oid":"purged_snaps","key":"","snapid":0,"hash":1983890819,"max":0,"pool":-1,"namespace":"","max":0}]' list-omap
PSN__5_0000000000000002
$ bin/ceph-objectstore-tool --no-mon-config --data-path dev/osd0  '["meta",{"oid":"purged_snaps","key":"","snapid":0,"hash":1983890819,"max":0,"pool":-1,"namespace":"","max":0}]' get-omap PSN__5_0000000000000002
Base64:BQAAAAAAAAACAAAAAAAAAAMAAAAAAAAA

Decoding of those 3 integers will need to be performed manually, I'm afraid, as we're lacking dedicated type for ceph-dencoder.

Actions #94

Updated by Roel van Meer 12 months ago

Succes!

Attached is a tarball (purged-snaps-omap.tar.gz) with some of the results. We found several things that stood out:

- The purged_snaps object had ~92000 omap keys (listed in purged_snaps.list-omap)
- For two pools there were many omap keys (pool 1 has 837, pool 2 has 90771 keys). The other pools just had 1 key.
- Interestingly, pool 2 (which is the OLD ssd pool that has been removed) no longer exists. I'm not sure if it is correct that the omap keys for this pool still exist.

Since extracting the omap data was a bit slow, we didn't extract all ~92000 keys, but just a subset. The extracted values are in the omap dir in the tarball.

Actions #95

Updated by Roel van Meer 12 months ago

Hi Ilya, Radoslaw,

there's something else that I need to share. At the end of this week my employment at 1afa comes to an end. My colleague Michiel will take over this case from our end (although I will remain involved as well as I can, since I would like to see this brought to the finish line).
I would like to express my gratitude to you both for your time and patience in helping to figure out this problem. It has been a pleasure working with you!

Best regards and good health, Roel

Actions #96

Updated by Radoslaw Zarzynski 12 months ago

Hi Roel!

Thank you for the priceless inputs and for your entire involvement!

void SnapMapper::record_purged_snaps(
  CephContext *cct,
  OSDriver& backend,
  OSDriver::OSTransaction&& txn,
  map<epoch_t,mempool::osdmap::map<int64_t,snap_interval_set_t>> purged_snaps)
{
  dout(10) << __func__ << " purged_snaps " << purged_snaps << dendl;
  map<string,ceph::buffer::list> m;
  set<string> rm;
  for (auto& [epoch, bypool] : purged_snaps) {
    // index by (pool, snap) 
    for (auto& [pool, snaps] : bypool) {
...
    }
  } 
  txn.remove_keys(rm);
  txn.set_keys(m);
  dout(10) << __func__ << " rm " << rm.size() << " keys, set " << m.size()
           << " keys" << dendl;
}

I will be digging over the data tomorrow but started thinking about the hypothesis: in theory record_purged_snaps() is responsible also for removing the keys but I'm not sure about its behavior on pool removal. What if this bit is in-operational or even missing AND we recycle, even accidentally, a pool ID?

Your IDs are pretty high. Do you recall by any chance how many pools were created (and then maybe deleted) during the disaster recoveries?

Regards,
Radek

Actions #97

Updated by Roel van Meer 12 months ago

Radoslaw Zarzynski wrote:

Your IDs are pretty high. Do you recall by any chance how many pools were created (and then maybe deleted) during the disaster recoveries?

As I recall, during the actual disaster recovery only one pool was deleted (the original "ssd" pool, with ID 2) and one pool was created (the new "ssd" pool with ID 15213). Although my recollection may be incorrect, as that doesn't explain that the device_health_metrics pool has ID 15113.

In any case, for a while after setting up this cluster we ran a performance test, which created a pool, ran some IO measurements, and deleted the pool, so that's the reason why the pool IDs increased so much. I am pretty sure that we stopped doing this immediately when we had to recreate the ssd pool, but I have no explanation for the gap between the IDs of the "device_health_metrics" and "ssd" pools. Sorry that I cannot be more clear here.

Actions #98

Updated by Radoslaw Zarzynski 12 months ago

I was able to reproduce a stale PSN mapping locally (1 mon / 1 mgr / 1 osd / 1 pool / 1 PG).

[rzarzynski@o06 build]$ bin/rados -p test-pool rmsnap snap1
2023-05-05T18:41:20.675+0000 7fa158789080 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T18:41:20.679+0000 7fa158789080 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T18:41:20.679+0000 7fa158789080 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
removed pool test-pool snap snap1
[rzarzynski@o06 build]$ bin/rados -p test-pool rmsnap snap2
2023-05-05T18:41:22.597+0000 7f2b00602080 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T18:41:22.600+0000 7f2b00602080 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T18:41:22.600+0000 7f2b00602080 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
removed pool test-pool snap snap2

Note: the path from new_removed_snaps over new_purged_snaps to record_purged_snaps() is pretty bumpy and conveys mon -> osd -> mgr (PGMapDigest) -> mon -> osd communication (without MGR it's impossible generate the PSN entries).

out/osd.0.log:2023-05-05T18:41:29.533+0000 7f6525386700 10 snap_mapper.record_purged_snaps purged_snaps {14={1=[1~2]}}                                                                                            
out/osd.0.log-2023-05-05T18:41:29.533+0000 7f6525386700 20 snap_mapper._lookup_purged_snap pool 1 snap 0 key 'PSN__1_0000000000000000' lower_bound not found                                                      
out/osd.0.log-2023-05-05T18:41:29.533+0000 7f6525386700 20 snap_mapper._lookup_purged_snap pool 1 snap 3 key 'PSN__1_0000000000000003' lower_bound not found                                                      
[rzarzynski@o06 build]$ bin/ceph osd pool delete test-pool test-pool --yes-i-really-really-mean-it
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2023-05-05T18:42:41.606+0000 7f59dc947700 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T18:42:41.616+0000 7f59dc947700 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
pool 'test-pool' removed
2023-05-05T18:42:42.069+0000 7f6525386700 10 snap_mapper.record_purged_snaps purged_snaps {15={}}                                                                                                   
2023-05-05T18:42:42.069+0000 7f6525386700 10 snap_mapper.record_purged_snaps rm 0 keys, set 0 keys                                                                                                  
2023-05-05T18:42:42.069+0000 7f6525386700 10 osd.0 14 write_superblock sb(0971350a-aa67-4836-8ab6-ad8b3d027092 osd.0 7b38f137-fb94-4c3f-adad-a2ae8f57205a e15 [1,15] lci=[5,15] tlb=1)
...
2023-05-05T18:42:42.082+0000 7f6517e9b700  5 osd.0 pg_epoch: 15 pg[1.0( v 13'9 (0'0,13'9] lb MIN local-lis/les=7/8 n=1 ec=7/7 lis/c=7/7 les/c/f=8/8/0 sis=15 pruub=14.037259102s) [] r=-1 lpr=15 DELETING pi=[7,15)/1 crt=13'9 lcod 12'7 mlcod 0'0 unknown NOTIFY pruub 162.010543823s@ mbc={}] exit Started/ToDelete/Deleting 0.010747 2 0.003075
[rzarzynski@o06 build]$ killall ceph-osd
2023-05-05T18:43:43.496+0000 7f6534320700 -1 received  signal: Terminated from killall ceph-osd  (PID: 1666068) UID: 1111
2023-05-05T18:43:43.496+0000 7f6534320700 -1 osd.0 15 *** Got signal Terminated ***
[rzarzynski@o06 build]$ bin/ceph-objectstore-tool --no-mon-config --data-path dev/osd0 --op meta-list | grep purg
["meta",{"oid":"purged_snaps","key":"","snapid":0,"hash":1983890819,"max":0,"pool":-1,"namespace":"","max":0}]
[rzarzynski@o06 build]$ bin/ceph-objectstore-tool --no-mon-config --data-path dev/osd0 '["meta",{"oid":"purged_snaps","key":"","snapid":0,"hash":1983890819,"max":0,"pool":-1,"namespace":"","max":0}]' list-omap
PSN__1_0000000000000002
[rzarzynski@o06 build]$ bin/ceph-objectstore-tool --no-mon-config --data-path dev/osd0 '["meta",{"oid":"purged_snaps","key":"","snapid":0,"hash":1983890819,"max":0,"pool":-1,"namespace":"","max":0}]' get-omap PSN__1_0000000000000002
Base64:AQAAAAAAAAABAAAAAAAAAAMAAAAAAAAA

The PG deletion procedure cleans the regular (SNA) mappings but I cannot find any code that would take care about the purged snaps (PSN) entries.

std::pair<ghobject_t, bool> PG::do_delete_work(
  ObjectStore::Transaction &t,
  ghobject_t _next)
{
...
  OSDriver::OSTransaction _t(osdriver.get_transaction(&t));
  int64_t num = 0;
  for (auto& oid : olist) {
    if (oid == pgmeta_oid) {
      continue;
    }
    if (oid.is_pgmeta()) {
      osd->clog->warn() << info.pgid << " found stray pgmeta-like " << oid
                        << " during PG removal";
    }
    int r = snap_mapper.remove_oid(oid.hobj, &_t);
    if (r != 0 && r != -ENOENT) {
      ceph_abort();
    }
    t.remove(coll, oid);
    ++num;
  }
...

After bringing the OSD back the mapping is still visible to SnapMapper::Scrubber.

[rzarzynski@o06 build]$ /home/rzarzynski/ceph1/build/bin/ceph-osd -i 0 -c /home/rzarzynski/ceph1/build/ceph.conf                                                                                                  
2023-05-05T19:39:03.763+0000 7fc0e7d17640 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T19:39:03.766+0000 7fc0e7d17640 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T19:39:03.768+0000 7fc0e7d17640 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T19:39:04.792+0000 7fc0e7d17640 -1 Falling back to public interface
2023-05-05T19:39:06.316+0000 7fc0e7d17640 -1 osd.0 16 log_to_monitors true
[rzarzynski@o06 build]$ bin/ceph tell osd.\* scrub_purged_snaps                                                                                                                                                   
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2023-05-05T19:39:11.779+0000 7f46011d0700 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-05T19:39:11.786+0000 7f46011d0700 -1 WARNING: all dangerous and experimental features are enabled.

2023-05-05T19:39:11.810+0000 7fc0de8ce700 20 snap_mapper._parse_p purged_snaps pool 1 [1,3)

Verified on reasonably fresh main.
TODO: check the pool ID reusage policy to judge how bad this really is.

Actions #99

Updated by Radoslaw Zarzynski 12 months ago

At first glance we don't reuse pool IDs. If this stays true, the impact should be pretty limited, basically to the overrun case. What makes me nervous is how Roel's cluster run into this. Do you we reclaim in some minor cases?

Leaving junks behind is another story.

[rzarzynski@o06 build]$ bin/ceph osd lspools                                                                                                                                                                       
2023-05-08T11:49:45.190+0000 7f4c4a968700 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                         
2023-05-08T11:49:45.196+0000 7f4c4a968700 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                         
2 test-pool
[rzarzynski@o06 build]$ bin/ceph pg dump                                                                                                                                                                          
2023-05-08T11:51:39.652+0000 7fb16c6c5700 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
2023-05-08T11:51:39.658+0000 7fb16c6c5700 -1 WARNING: all dangerous and experimental features are enabled.                                                                                                        
version 117333
stamp 2023-05-08T11:51:39.359158+0000
last_osdmap_epoch 0
last_pg_scan 0                                                                                                                                                                                                    
PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*  OMAP_KEYS*  LOG  LOG_DUPS  DISK_LOG  STATE         STATE_STAMP                      VERSION  REPORTED  UP   UP_PRIMARY  ACTING  ACTING_PRIMARY  LAST_SCRUB  SCRUB_STAMP                      LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN  LAST_SCRUB_DURATION  SCRUB_SCHEDULING                                          
 OBJECTS_SCRUBBED  OBJECTS_TRIMMED                                                                                                                                                                                
2.0            0                   0         0          0        0      0            0           0    0         0         0  active+clean  2023-05-08T11:49:38.710214+0000      0'0     24:10  [0]           0    
[0]               0         0'0  2023-05-08T11:49:37.698878+0000              0'0  2023-05-08T11:49:37.698878+0000              0                    0  periodic scrub scheduled @ 2023-05-09T15:58:01.697344+0000
                0                0

2  0  0  0  0  0  0  0  0  0  0

sum  0  0  0  0  0  0  0  0  0  0
OSD_STAT  USED     AVAIL    USED_RAW  TOTAL    HB_PEERS  PG_SUM  PRIMARY_PG_SUM
0         1.0 GiB  100 GiB   1.0 GiB  101 GiB        []       1               1
sum       1.0 GiB  100 GiB   1.0 GiB  101 GiB

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details.
dumped all
(failed reverse-i-search)`yes': bin/ceph -c ceph.conf osd pool create test-pool 128 128 && bin/ceph -c ceph.conf osd pool set test-pool size 1 --^Cs-i-really-mean-it && bin/ceph -c ceph.conf osd pool set test-po
ol min_size 1
[rzarzynski@o06 build]$ bin/ceph osd pool delete test-pool test-pool --yes-i-really-really-mean-it
2023-05-08T11:52:38.342+0000 7f8530459700 -1 WARNING: all dangerous and experimental features are enabled.
2023-05-08T11:52:38.352+0000 7f8530459700 -1 WARNING: all dangerous and experimental features are enabled.
pool 'test-pool' removed
[rzarzynski@o06 build]$ bin/ceph -c ceph.conf osd pool create test-pool 1 1 && bin/ceph -c ceph.conf osd pool set test-pool size 1 --yes-i-really-mean-it && bin/ceph -c ceph.conf osd pool set test-pool min_size 1
2023-05-08T11:52:47.020+0000 7f8e20708700 -1 WARNING: all dangerous and experimental features are enabled.
2023-05-08T11:52:47.030+0000 7f8e20708700 -1 WARNING: all dangerous and experimental features are enabled.
pool 'test-pool' created
2023-05-08T11:52:48.172+0000 7f9204d82700 -1 WARNING: all dangerous and experimental features are enabled.
2023-05-08T11:52:48.183+0000 7f9204d82700 -1 WARNING: all dangerous and experimental features are enabled.
set pool 3 size to 1
2023-05-08T11:52:49.171+0000 7fa723a05700 -1 WARNING: all dangerous and experimental features are enabled.
2023-05-08T11:52:49.180+0000 7fa723a05700 -1 WARNING: all dangerous and experimental features are enabled.
set pool 3 min_size to 1
[rzarzynski@o06 build]$ bin/ceph osd lspools
2023-05-08T11:52:53.239+0000 7fbbefcce700 -1 WARNING: all dangerous and experimental features are enabled.
2023-05-08T11:52:53.245+0000 7fbbefcce700 -1 WARNING: all dangerous and experimental features are enabled.
3 test-pool
[rzarzynski@o06 build]$ bin/ceph pg dump
2023-05-08T11:52:56.550+0000 7f113097b700 -1 WARNING: all dangerous and experimental features are enabled.
2023-05-08T11:52:56.556+0000 7f113097b700 -1 WARNING: all dangerous and experimental features are enabled.
version 117375
stamp 2023-05-08T11:52:55.381503+0000
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*  OMAP_KEYS*  LOG  LOG_DUPS  DISK_LOG  STATE         STATE_STAMP                      VERSION  REPORTED  UP   UP_PRIMARY  ACTING  ACTING_PRIMARY  LAST_SCRUB  SCRUB_STAMP                      LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN  LAST_SCRUB_DURATION  SCRUB_SCHEDULING                                            OBJECTS_SCRUBBED  OBJECTS_TRIMMED
3.0            0                   0         0          0        0      0            0           0    0         0         0  active+clean  2023-05-08T11:52:49.053458+0000      0'0      27:9  [0]           0     [0]               0         0'0  2023-05-08T11:52:48.043648+0000              0'0  2023-05-08T11:52:48.043648+0000              0                    0  periodic scrub scheduled @ 2023-05-09T20:53:37.562042+0000                 0                0

3  0  0  0  0  0  0  0  0  0  0

sum  0  0  0  0  0  0  0  0  0  0
OSD_STAT  USED     AVAIL    USED_RAW  TOTAL    HB_PEERS  PG_SUM  PRIMARY_PG_SUM
0         1.0 GiB  100 GiB   1.0 GiB  101 GiB        []       1               1
sum       1.0 GiB  100 GiB   1.0 GiB  101 GiB                                  

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details.
dumped all
Actions #100

Updated by Predrag Mandic 10 months ago

Hello,

We seem to have a customer with this same issue. What data can we provide to help solve this issue?

Kind regards,
Predrag Mandic

Actions #101

Updated by Michiel van den Berg 6 months ago

Hello good morning,

I'm very sorry about such a late reply to this thread.
The problem has been resolved at my site. The solution I think has been a full repair of our cluster, in more detail.

While doing normal maint. on the environment I ran into an issue with a disk reporting old omap structure. Apparently some drives were never upgraded to per-pg omap structure. After running fsck on a few OSD's to try and fix the issue. Eventually I wrote a script which ran for about 2 weeks, manually bringing down every osd, running a repair and bringing them up again. Some of the affected OSD's were the ones in use with our previous tests.
After the repair was done on the full cluster I have not been able to reproduce the issue any more.

Kind regards,

Actions #102

Updated by Adam Kupczyk about 2 months ago

  • Priority changed from High to Normal
Actions

Also available in: Atom PDF