Bug #2178
rbd: corruption of first block
0%
Related issues
Associated revisions
filestore: fix collection_move guard
We had a sequence like:
1- write A block 1
2- write A block 2
3- write A block 3
4- write A block 4
5- move A -> B
- link B
- unlink A
- set guard on B <crash>
- replay 3, 4, 5
with the result being B with only half of its content. The problem is that
we destroyed the old link and didn't guard the new content. Instead,
set the guard before the link, and replay the unlink step here
unconditionally.
Fixes: #2178
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
History
#1 Updated by Oliver Francke over 11 years ago
- File vm-451-rb.0.0.000000000000.rbd.bz2 added
Here is one of many, where the header is missing:
--- 8-< ---
fcms@fcmsnode3:~$ rbd ls 1320396354
vm-451-disk-1.rbd
fcms@fcmsnode3:~$ rbd info 1320396354/vm-451-disk-1.rbd
2012-03-15 09:58:48.090190 7f58cc1f3760 librbd: Error reading header: (2) No such file or directory
error opening image vm-451-disk-1.rbd: (2) No such file or directory
--- 8-< ---
There is only one image, so the prefix is easy detectable as "rb.0.0."
From the "ceph osd dump | grep 1320396354" I get:
--- 8-< ---
pool 97 '1320396354' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 0 lpgp_num 0 last_change 349 owner 18446744073709551615
--- 8-< ---
The osd.2.log shows now a bunch of:
--- 8-< ---
012-03-15 09:58:44.686768 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 35a41435/rb.0.0.000000000078/head
2012-03-15 09:58:44.686773 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing d27de435/rb.0.0.00000000048e/head
2012-03-15 09:58:44.686779 7ff1cd48f700 log [ERR] : 97.5 osd.4: soid 75619735/rb.0.0.000000000221/headsize 2052096 != known size 4194304
2012-03-15 09:58:44.686785 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 3a874a45/rb.0.0.000000000222/head
2012-03-15 09:58:44.686791 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 73768555/rb.0.0.00000000023b/head
.
.
.
012-03-15 09:58:44.687147 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing ae19a0d5/rb.0.0.0000000005e5/head
2012-03-15 09:58:44.687152 7ff1cd48f700 log [ERR] : 97.5 osd.4: soid 387e23d5/rb.0.0.000000000231/headsize 1355776 != known size 4194304
2012-03-15 09:58:44.687158 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing ea2a44d5/rb.0.0.00000000069e/head
.
.
.
2012-03-15 09:58:44.687612 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 533fafd/rb.0.0.000000000526/head
2012-03-15 09:58:44.687618 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing cc8b7dfd/rb.0.0.0000000006ff/head
2012-03-15 09:58:44.687623 7ff1cd48f700 log [ERR] : 97.5 scrub 151 missing, 3 inconsistent objects
2012-03-15 09:58:44.688311 7ff1cd48f700 log [ERR] : 97.5 scrub stat mismatch, got 158/4 objects, 0/0 clones, 647046656/9236480 bytes.
2012-03-15 09:58:44.688324 7ff1cd48f700 log [ERR] : 97.5 scrub 1 errors
2012-03-15 09:58:45.288066 7ff1cdc90700 log [INF] : 1.65 scrub ok
--- 8-< ---
... and from other pools, too...
Additionally I got the first block from the image, looks almost identically to
an intact one, but with an intact one I could do a "fdisk -l", which might not
be an evidence at all... I'm not sure yet.
Healthy:
--- 8-< ---
root@fcmsnode3:/home/fcms# fdisk -l rb.0.c.000000000000
You must set cylinders.
You can do this from the extra functions menu.
Disk rb.0.c.000000000000: 0 MB, 0 bytes
255 heads, 63 sectors/track, 0 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0009851f
Device Boot Start End Blocks Id System
rb.0.c.000000000000p1 * 1 6375 51198975+ 83 Linux
Partition 1 has different physical/logical endings:
phys=(1023, 254, 63) logical=(6374, 26, 51)
--- 8-< ---
no so healthy:
--- 8-< ---
root@fcmsnode3:/home/fcms# fdisk -l vm-451-rb.0.0.000000000000.rbd
root@fcmsnode3:/home/fcms#
--- 8-< ---
There are some informations already missing in this block... and the repeated
question: Why is there a change in the very first block... sigh
So, If I now create a header, put it back into the pool, I could be able to
export the image and then look via testdisk/fsck & friends how much data I can
save for the customer...
Some more ideas and clues very welcome,
Oliver.
#2 Updated by Oliver Francke over 11 years ago
Hi *,
any update on this topic? Cause we are working for hours and days with three people to rescue as many images as possible. Most of them have - at least - corrupted first blocks..., from "initramfs"-, "grub-rescue"-, "fsck"-errors.
Well, we cannot help them, but if one could identify this kind of error, this would raise the acceptance again.
Thnx n regards,
Oliver.
#3 Updated by Josh Durgin over 11 years ago
I looked at the block you attached, and compared it to the first 4MiB of my desktop's hard drive. It looks like it contains an MBR, but not a partition table. The output from gdisk seems to indicate that the actual partition table is in another object:
$ gdisk -l vm-451-rb.0.0.000000000000.rbd GPT fdisk (gdisk) version 0.6.14 Error seeking to or reading logical partition data from 39847934! Aborting! Partition table scan: MBR: MBR only BSD: not present APM: not present GPT: not present *************************************************************** Found invalid GPT and valid MBR; converting MBR to GPT format. *************************************************************** Warning! Secondary partition table overlaps the last partition by 39837736 blocks! You will need to delete this partition or resize it in another utility. Disk vm-451-rb.0.0.000000000000.rbd: 8185 sectors, 4.0 MiB Logical sector size: 512 bytes Disk identifier (GUID): A5909945-8E64-4170-881F-9A8045C60899 Partition table holds up to 128 entries First usable sector is 34, last usable sector is 8151 Partitions will be aligned on 2048-sector boundaries Total free space is 2014 sectors (1007.0 KiB) Number Start (sector) End (sector) Size Code Name 1 2048 39845887 19.0 GiB 0700 Linux/Windows data
This is similar to my sda, which contains the MBR, but stores the partition table on the first partition:
$ gdisk -l /dev/sda sudo gdisk -l /dev/sda GPT fdisk (gdisk) version 0.6.14 Partition table scan: MBR: MBR only BSD: not present APM: not present GPT: not present *************************************************************** Found invalid GPT and valid MBR; converting MBR to GPT format. *************************************************************** Disk /dev/sda: 156250000 sectors, 74.5 GiB Logical sector size: 512 bytes Disk identifier (GUID): 1B9FC214-F825-4B14-8759-3C74F61F5114 Partition table holds up to 128 entries First usable sector is 34, last usable sector is 156249966 Partitions will be aligned on 2048-sector boundaries Total free space is 5965 sectors (2.9 MiB) Number Start (sector) End (sector) Size Code Name 1 2048 149995519 71.5 GiB 0700 Linux/Windows data 5 149997568 156248063 3.0 GiB 8200 Linux swap $ gdisk -l /dev/sda1 GPT fdisk (gdisk) version 0.6.14 Partition table scan: MBR: not present BSD: not present APM: not present GPT: not present Creating new GPT entries. Disk /dev/sda1: 149993472 sectors, 71.5 GiB Logical sector size: 512 bytes Disk identifier (GUID): 3ED0C07D-011D-4372-8F0C-AE2105E9FC9B Partition table holds up to 128 entries First usable sector is 34, last usable sector is 149993438 Partitions will be aligned on 2048-sector boundaries Total free space is 149993405 sectors (71.5 GiB) Number Start (sector) End (sector) Size Code Name
I tried expanding the first block to 20G, and gdisk got further, but stopped when the signature in the partition table (at sector 39847934) was all zeroes:
$ truncate -s 20G vm-451-rb.0.0.000000000000.rbd $ gdisk -l vm-451-rb.0.0.000000000000.rbd GPT fdisk (gdisk) version 0.6.14 MBR signature in logical partition invalid; read 0x0000, but should be 0xAA55 Partition table scan: MBR: MBR only BSD: not present APM: not present GPT: not present *************************************************************** Found invalid GPT and valid MBR; converting MBR to GPT format. *************************************************************** Disk vm-451-rb.0.0.000000000000.rbd: 41943040 sectors, 20.0 GiB Logical sector size: 512 bytes Disk identifier (GUID): BA326ACF-590A-49B0-890F-75FCA4E6344D Partition table holds up to 128 entries First usable sector is 34, last usable sector is 41943006 Partitions will be aligned on 2048-sector boundaries Total free space is 2099133 sectors (1.0 GiB) Number Start (sector) End (sector) Size Code Name 1 2048 39845887 19.0 GiB 0700 Linux/Windows data
This means the unbootable guests could be caused by missing objects later in the image, since a missing object is treated as zeroes by rbd.
Do other first objects have the same behavior as this one? That is, do they have MBRs pointing to other areas of the disk that might be missing objects?
#4 Updated by Oliver Francke over 11 years ago
Hi Josh,
thanks for taking the time to investigate this... And yes, many others show the same behaviour. Is "the next block" the next in the numeric sequence? That is ...0 -> ...1 ( or whatever comes next)?
Thnx n regards,
Oliver.
#5 Updated by Josh Durgin over 11 years ago
The next object is whatever the MBR points to. You can find the object name from the sector offset that gdisk gives you with:
perl -e 'printf "BLOCK_NAME_PREFIX.%012x\n", ((SECTOR_OFFSET * 512) / (4 * 1024 * 1024))'
In this case, BLOCK_NAME_PREFIX was rb.0.0, and gdisk told us the sector offset: "Error seeking to or reading logical partition data from 39847934!", so the logical partition header would be in the object named rb.0.0.000000001300
#6 Updated by Oliver Francke over 11 years ago
Good morning ;)
meanwhile I have not been lazy. I've managed - with current setup in test-cluster - to produce "inconsistencies" in the PG's, and:
1. write a file sequentially ( a file in each VM, in sum 4 VM's running)
2. stop/start OSD's while file is written
3. after all is synch'd, there a "holes" somewhere in the file, and not at the end.
@Sage: we have another 2 images from customers with "no header" problem, would be cool if you'd have time to have a look at them 8-)
Kind regards,
Oliver.
#7 Updated by Oliver Francke over 11 years ago
Well,
one more comment:
my guess would be, it has todo something with expansion of the "sparse-file" while writing new files, what I currently do with my test-script.
Best regards,
Oliver.
#8 Updated by Sage Weil over 11 years ago
The next step is to reproduce the corruption on the test cluster with logs:
debug osd = 20
debug ms = 1
debug filestore = 20
debug journal = 10
would be ideal.
My understanding of the process is:
- write a bunch of data inside the vm
- kill + restart ceph-osd out from underneath
- verify the written data inside the vm ---> discover zeros
One theory is that this is the read/write reordering problem with 'rbd writeback window' + recovery.
#9 Updated by Oliver Francke over 11 years ago
- File cephnode3.crash-with-debug.0.44.1.log.bz2 added
Hi,
I decided to upgrade to "latest-n-greatest" in the test-cluster, to make sure, that if I hit the error again we are not dealing with known-errors. But: after a "ceph osd repair 1" or "ceph pg repair 0.69" osd tends to crash.
Logfile with debugging is attached.
Not sure if I should continue my testing at this point?!
Kind regards,
Oliver.
#10 Updated by Sage Weil over 11 years ago
Okay, I suspect this is actually bug #2164, which was causing the _ xattr to get lost when ceph-osd restarts on non-btrfs. Running the fix through some final testing before we push it.
This may explain the rbd-visible behavior, too. From reading the code, it looks like the OSD will reply with -ENODATA, and librbd will return that to the read_iterate() caller... not sure what qemu will do with it... Josh?
#11 Updated by Oliver Francke over 11 years ago
Hi *,
I needed a couple of runs, but managed now to provide some 81MiB/97MiB osd.X.log-files, where in between sh.t happened, that is: I had 4 VM's running, each writing some 8KiB pages into a file, so growing, and the verification shows now a couple of 4MiB-holes within the file. Some dataloss even at the end.
I did my own logging from what I have written and when with gettimeofday, so ยต-seconds granularity. All nodes/VM's are ntp-synchronized, so perhaps this is useful to narrow-down some problems.
@Sage: I put the logfiles in root@fcmsnode3:osd.x-...
Here the relevant part from one logfile:
20120330 110203.287078 block= 10527 seekpos= 86237184 BLOCKSIZE= 8192 nread= 8192
and then ^@ = NULL = empty but readable up to:
20120330 110203.595194 block= 11039 seekpos= 90431488 BLOCKSIZE= 8192 nread= 8192
that is 512 * 8192 = 4MiB filled with NULLs.
Block 10527 + 10528 were written sucessfully at:
20120330 105033.665610 filling block 10527/128000, seekpos=86237184
20120330 105033.665731 filling block 10528/128000, seekpos=86245376
so this should be the pointer into the corresponding osd-logfiles.
The outages were produced by "ceph pg repair 0.69", waited until monitor got aware of outage and marked down, then restart followed... this 2-3 times...
Hope this helps,
Oliver.
#12 Updated by Oliver Francke over 11 years ago
Here the remaining timestamps from the other VM's with bad blocks:
VM-2:
20120330 105139.579830 filling block 17177/128000, seekpos=140713984
VM-3:
20120330 105032.316980 filling block 9314/128000, seekpos=76300288
VM-4:
20120330 105033.361560 filling block 18899/128000, seekpos=154820608
so some 3 similar timestamps from 1+3+4.
Best regards,
Oliver.
#13 Updated by Sage Weil over 11 years ago
Please build the current git stable branch, which includes 41f84fac1ae4b4c72bf9bfe07614c4066c916fd1. The version should be v0.44.1-1-g41f84fa. And then try to reproduce the process of corrupting things.
Note that this doesn't fix the already-broken cluster; it should prevent a new corruption. So either use a fresh pool, or wipe the test cluster first.
Please let us know if that behaves better!
#14 Updated by Sage Weil over 11 years ago
- Status changed from Need More Info to 7
#15 Updated by Oliver Francke over 11 years ago
Hi Sage,
here we go again, with ceph-0.44.1-1-g41f84fa
One bad file with following infos:
20120402 171642.122776 block= 76506 seekpos= 626737152 BLOCKSIZE= 8192 nread= 8192
mangled up to:
20120402 171642.176825 block= 76891 seekpos= 629891072 BLOCKSIZE= 8192 nread= 8192
so around 3080 KiB filled with NULLs.
Logfiles osd.[01].log-ceph-0.44.1-1-g41f84fa.bz2 are on their way to root@fcmsnode3:...
What else do you need?
Kind regards,
Oliver.
#16 Updated by Sage Weil over 11 years ago
Hi Oliver,
I have two things to try:
- 'rbd writeback window = 0'. I know it's not what you want to run, but there is a known problem with the window and osd restart in certain situations, and I want to know if that is what is happening here.
- 'filestore fiemap threshold = 8000000' in the [osd] sections (and restart the osds). This will rule out the possibility of a FIEMAP problem with xfs.
Thanks!
#17 Updated by Oliver Francke over 11 years ago
Well Sage,
its harder these days to reproduce, cause I think the current version has made "something more stable"(tm).
After 10 runs or so I have two VMs with multiple "holes" in them...
So the provided logfiles from monday do not show anything? That would be quite contraproductive... Should next test-runs made w/ or w/o debug-logs enabled?
The rbd_writeback_window.... well, IIRC a Josh Pieper was writing about a similar complain from a windows-VM to the ceph-devel-list, where he looses files from some test-scripts... without the caching the problem is gone, but all awfully slow. If you/Josh D. suspect, it's a reordering prob or s/t like this, I would need perhaps another 20 runs to see... it's running fine... or do I need 30? And why are there no hints in the logfiles? ;)
You see where I come from, will start-off the next runs with some "fiemap" settings...
Thnx and best regards,
Oliver.
#18 Updated by Oliver Francke over 11 years ago
Whew, that was fast,
after second run I had some errors in one file with:
[osd]
filestore fiemap threshold = 8000000
setting applied to both OSDs ( not in osd.X-section, right?)
Kind regards,
Oliver.
#19 Updated by Sage Weil over 11 years ago
Oliver Francke wrote:
Whew, that was fast,
after second run I had some errors in one file with:
[osd]
filestore fiemap threshold = 8000000setting applied to both OSDs ( not in osd.X-section, right?)
Is that is with 'rbd writeback window = 0' also? Can you put the logs somewhere?
#20 Updated by Oliver Francke over 11 years ago
Hi Sage,
I was talking about the verbose logfiles from monday. TBH, I don't expect BadThings without "rbd_writeback_window", running now 5 times without issues... The logfiles are on root@fcmsnode3...
The recent NULLs came after setting the FIEMAP, but without verbose logs.
Next steps now welcome ;)
Regards,
Oliver.
#21 Updated by Sage Weil over 11 years ago
Oliver Francke wrote:
Hi Sage,
I was talking about the verbose logfiles from monday. TBH, I don't expect BadThings without "rbd_writeback_window", running now 5 times without issues... The logfiles are on root@fcmsnode3...
If that's the case then we're in the clear... the writeback window issue is a known problem and will be obsolete in 0.46 with the new caching code.
The recent NULLs came after setting the FIEMAP, but without verbose logs.
Next steps now welcome ;)
That was with writeback window still enabled?
If you can reproduce the corruption with writeback window = 0, that is what we want to look at carefully.
#22 Updated by Oliver Francke over 11 years ago
Uhm...
... I thought, we were talking about the same issue since the very beginning... corruption of .rbd-blocks... And now it's all clear? sigh, what a waste of time I put into this issue.
Are you able to provide a patch already to be tested by my evil script? ( That's the only advantage I see. I'm able to reproduce things now in a test-bed...)
Regards,
Oliver.
#23 Updated by Sage Weil over 11 years ago
Oliver Francke wrote:
Uhm...
... I thought, we were talking about the same issue since the very beginning... corruption of .rbd-blocks... And now it's all clear? sigh, what a waste of time I put into this issue.
We are talking about the same time, I just need to keep straight exactly which configuration settings were active when you have seen corruption. To summarize, you've seen it
- with rbd writeback window != 0
- with and without the latest patch in stable
- with and without the rbd fiemap threshold adjustment.
And (can you please confirm that) you haven't yet seen it
- with rbd writeback window = 0
?
Are you able to provide a patch already to be tested by my evil script? ( That's the only advantage I see. I'm able to reproduce things now in a test-bed...)
That's my goal, as soon as I can confirm we're not hitting the writeback window issue. (Being able to reproduce this is huge, I'm extremely grateful for that!)
#24 Updated by Sage Weil over 11 years ago
Ok, my attempts to parse the log to find out of order replies is quickly snowballing. (complexity of dropped replies due to peering, etc.) Can you please est against wip-osd-reorder branch, v0.44.1-2-g23f95a9, with rbd writeback window still enabled, and see if you still see the problem? <fingers crossed that it goes away>
Thanks!
#25 Updated by Oliver Francke over 11 years ago
Hi Sage and Happy easter,
yesterday I had some "luck" after 10 tries....
Here is what I have for you:
first mangled block at:
20120408 164944.800905 block= 56123 seekpos= 459759616 BLOCKSIZE= 8192 nread= 8192 md5sum=
and then NULL's for exactly 2MiB.
block was written at approx.:
20120407 174903.815980 filling block 56123/128000, seekpos=459759616
at this point a osd-restart seemed to have occurred:
2012-04-07 17:49:01.111170 7f0cd3f9f700 filestore(/data/osd1) sync_entry commit took 0.292351, interval was 4.828282
2012-04-07 17:49:01.111234 7f0cd3f9f700 filestore(/data/osd1) sync_entry committed to op_seq 12791086
2012-04-07 17:49:01.111612 7f0cd722a780 journal close /dev/sdb1
2012-04-07 17:49:01.113171 7f0cd722a780 filestore(/data/osd1) test_mount basedir /data/osd1 journal /dev/sdb1
2012-04-07 17:49:01.113222 7f0cd722a780 -- 0.0.0.0:6800/6674 messenger.start
2012-04-07 17:49:01.113232 7f0cd722a780 -- 0.0.0.0:6800/6674 accepter.start
2012-04-07 17:49:01.113270 7f0cd722a780 -- :/0 messenger.start
2012-04-07 17:49:01.113288 7f0cd722a780 -- 0.0.0.0:6804/6674 messenger.start
2012-04-07 17:49:01.113295 7f0cd722a780 -- 0.0.0.0:6804/6674 accepter.start
2012-04-07 17:49:01.113334 7f0cd722a780 -- 0.0.0.0:6802/6674 messenger.start
2012-04-07 17:49:01.113342 7f0cd722a780 -- 0.0.0.0:6802/6674 accepter.start
2012-04-07 17:49:01.113486 7f0cd722a780 osd.1 0 mounting /data/osd1 /dev/sdb1
2012-04-07 17:49:01.113496 7f0cd722a780 filestore(/data/osd1) basedir /data/osd1 journal /dev/sdb1
2012-04-07 17:49:01.113533 7f0cd722a780 filestore(/data/osd1) mount FIEMAP ioctl is NOT supported
and then up to 17:49:05 almost quite, so the need for replay and so the mangling seems to have fallen into this period.
According logfiles you find on root@fcmsnode3:2178 in a while...
Good luck and best regards,
Oliver.
#26 Updated by Sage Weil over 11 years ago
The missing piece of information is mapping the file offset to a block device offset. Can you, inside the VM,
wget http://ceph.com/qa/fiemap.c wget http://ceph.com/qa/fiemap.h wget http://ceph.com/qa/linux_fiemap.h gcc fiemap.c -o fiemap ./fiemap /path/to/file > /tmp/foo.txt
and stick that info somewhere I can get it?
Thanks!
#27 Updated by Oliver Francke over 11 years ago
Hi Sage,
just in case, the reply from yesterday did not reach you:
--- 8-< ---
Good morning,
it's already there:
root@fcmsnode3:2178/fiemap.tmp.woZHE1V8nK
That's from inside the VM from the tmp-file.
Best regards,
Oliver.
--- 8-< ---
Oliver.
#28 Updated by Sage Weil over 11 years ago
the good news is i see the problem. the bad news is its the exact bug we thought we fixed. the other good news is we're almost done with a torture test designed to validate exactly this code.
i'll have a fix to test shortly.
#29 Updated by Oliver Francke over 11 years ago
Well Sage,
I have a torture-test already :-D
OK, so it's independent from yours and that's good. It sounds, we are on the right way now, and both heading the same direction ;)
Thnx and best regards,
Oliver.
#30 Updated by Sage Weil over 11 years ago
The most recent occurrence has been confirmed to be a replay issue with non-btrfs filesystems. The wip-guard branch fixes all of those known bugs, based on the tester that Joao has been working on. Can you retest with the latest wip-osd-reorder branch?
git fetch
git reset --hard origin/wip-osd-reorder
git describe
The version should be v0.44.2-184-g46c1b86.
If we can confirm that this problem "goes away" with that commit, I am satisfied that the old bugs are fixed. At which point we should move on to torturing the new rbd caching code which replaces 'rbd writeback window'. That has been merged into the latest master and is slated for 0.46. (In fact, even if you can still break wip-osd-reorder, we're be interested in hearing whether you can break the current master!)
#31 Updated by Oliver Francke over 11 years ago
Hi Sage,
here my notes, after almost 40 tests no bad things happened, only once a minor hickup, where the rbd-header remained in the pool...
So, what's next, rbd_cache_enable, yeah.
Just tried with v0.45-207-g3053e47 and an entry ala:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_cache_enabled=1
instead of:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_writeback_cache=....
but not avail.
What do I miss here? Tried with:
v0.45-207-g3053e47
Thnx in @vance,
Oliver.
#32 Updated by Oliver Francke over 11 years ago
Meanwhile continued to test...:
I noticed some negative degredation:
2012-04-18 14:43:37.282634 pg v128104: 352 pgs: 347 active+clean, 5 active+recovering; 58301 MB data, 113 GB used, 246 GB / 359 GB avail; -90/31296 degraded (-0.288%)
2012-04-18 14:43:38.309982 pg v128105: 352 pgs: 350 active+clean, 2 active+recovering; 58141 MB data, 113 GB used, 246 GB / 359 GB avail; -70/31216 degraded (-0.224%)
2012-04-18 14:43:42.285829 pg v128106: 352 pgs: 350 active+clean, 2 active+recovering; 57707 MB data, 111 GB used, 248 GB / 359 GB avail; -70/30994 degraded (-0.226%)
and, the second time an image got not completely removed:
(before next message an rbd rm data/907-testdisk.rbd is performed)
2012-04-18 14:43:42.001916 7f61ccee2760 librbd: rbd image header 907-testdisk.rbd.rbd already exists
create error: (17) File exists
Error creating data/907-testdisk.rbd
looks like the "rm" was delayed, so the "create" got onto this error?
Any ideas?
Kind regards,
Oliver.
#33 Updated by Oliver Francke over 11 years ago
- File osd.1-907-testdisk-exists.log.bz2 added
Here is a sequence copy-n-pasted:
rbd rm data/905-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/905-testdisk.rbd
rbd rm data/906-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/906-testdisk.rbd
2012-04-18 15:40:55.321093 7fbdcae3d760 librbd: rbd image header 906-testdisk.rbd.rbd already exists
create error: (17) File exists
Error 0 creating data/906-testdisk.rbd
So now I stopped testing. I should have attached two logfiles where former delete of 907-testdisk.rbd failed...
Hope it helps,
Oliver.
#34 Updated by Oliver Francke over 11 years ago
- File osd.0-907-testdisk-exists.log.bz2 added
second logfile here, sorry.
#35 Updated by Sage Weil over 11 years ago
Oliver Francke wrote:
Hi Sage,
here my notes, after almost 40 tests no bad things happened, only once a minor hickup, where the rbd-header remained in the pool...
Yay, great news! Looks like the mystery is solved.
So, what's next, rbd_cache_enable, yeah.
Just tried with v0.45-207-g3053e47 and an entry ala:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_cache_enabled=1
instead of:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_writeback_cache=....
but not avail.
What do I miss here? Tried with:
v0.45-207-g3053e47
'rbd cache = true'
#36 Updated by Sage Weil over 11 years ago
Oliver Francke wrote:
Here is a sequence copy-n-pasted:
rbd rm data/905-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/905-testdisk.rbd
rbd rm data/906-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/906-testdisk.rbd
2012-04-18 15:40:55.321093 7fbdcae3d760 librbd: rbd image header 906-testdisk.rbd.rbd already exists
create error: (17) File exists
Error 0 creating data/906-testdisk.rbd
Is it possible there is some other user, or the logs are from the wrong cluster?
I see:
- client.13507 deletes 906-testdisk.rbd (successfully)
- client.13508 creates 906-testdisk.rbd (presumably successfully)
I went looking for ways that 13508 could have gotten that error, and it is only generated at the beginning of the create sequence, before it does some subsequent work.. and I've verified that the subsequent work was done.
What commit are you running? Maybe i'm looking at the wrong version of the code.
#37 Updated by Sage Weil over 11 years ago
- Status changed from 7 to Resolved
moved this new issue to #2311, and resolving this bug. hooray!
#38 Updated by Oliver Francke over 11 years ago
Hi Sage,
sorry, was not clear enough. The logfiles provide informations for "907-testdisk.rbd..." not "906..."
The version was from the continued tests with former 0.44-2 wip-reorder...
Hope it helps,
Oliver.