Bug #2178: rbd: corruption of first block - Ceph - Ceph

Actions

Copy link

Bug #2178

closed

rbd: corruption of first block

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

librbd

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Files

Download all files

vm-451-rb.0.0.000000000000.rbd.bz2 (60.9 KB) vm-451-rb.0.0.000000000000.rbd.bz2		Oliver Francke, 03/16/2012 04:19 AM
cephnode3.crash-with-debug.0.44.1.log.bz2 (4.55 MB) cephnode3.crash-with-debug.0.44.1.log.bz2		Oliver Francke, 03/28/2012 03:52 AM
osd.1-907-testdisk-exists.log.bz2 (2.58 MB) osd.1-907-testdisk-exists.log.bz2		Oliver Francke, 04/18/2012 07:18 AM
osd.0-907-testdisk-exists.log.bz2 (3.72 MB) osd.0-907-testdisk-exists.log.bz2		Oliver Francke, 04/18/2012 07:22 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Oliver Francke about 12 years ago

File vm-451-rb.0.0.000000000000.rbd.bz2 vm-451-rb.0.0.000000000000.rbd.bz2 added

Here is one of many, where the header is missing:

--- 8-< ---
fcms@fcmsnode3:~$ rbd ls 1320396354
vm-451-disk-1.rbd
fcms@fcmsnode3:~$ rbd info 1320396354/vm-451-disk-1.rbd
2012-03-15 09:58:48.090190 7f58cc1f3760 librbd: Error reading header: (2) No such file or directory
error opening image vm-451-disk-1.rbd: (2) No such file or directory
--- 8-< ---

There is only one image, so the prefix is easy detectable as "rb.0.0."

From the "ceph osd dump | grep 1320396354" I get:

--- 8-< ---
pool 97 '1320396354' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 lpg_num 0 lpgp_num 0 last_change 349 owner 18446744073709551615
--- 8-< ---

The osd.2.log shows now a bunch of:

--- 8-< ---

012-03-15 09:58:44.686768 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 35a41435/rb.0.0.000000000078/head
2012-03-15 09:58:44.686773 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing d27de435/rb.0.0.00000000048e/head
2012-03-15 09:58:44.686779 7ff1cd48f700 log [ERR] : 97.5 osd.4: soid 75619735/rb.0.0.000000000221/headsize 2052096 != known size 4194304
2012-03-15 09:58:44.686785 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 3a874a45/rb.0.0.000000000222/head
2012-03-15 09:58:44.686791 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 73768555/rb.0.0.00000000023b/head
.
.
.
012-03-15 09:58:44.687147 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing ae19a0d5/rb.0.0.0000000005e5/head
2012-03-15 09:58:44.687152 7ff1cd48f700 log [ERR] : 97.5 osd.4: soid 387e23d5/rb.0.0.000000000231/headsize 1355776 != known size 4194304
2012-03-15 09:58:44.687158 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing ea2a44d5/rb.0.0.00000000069e/head
.
.
.
2012-03-15 09:58:44.687612 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing 533fafd/rb.0.0.000000000526/head
2012-03-15 09:58:44.687618 7ff1cd48f700 log [ERR] : 97.5 osd.4 missing cc8b7dfd/rb.0.0.0000000006ff/head
2012-03-15 09:58:44.687623 7ff1cd48f700 log [ERR] : 97.5 scrub 151 missing, 3 inconsistent objects
2012-03-15 09:58:44.688311 7ff1cd48f700 log [ERR] : 97.5 scrub stat mismatch, got 158/4 objects, 0/0 clones, 647046656/9236480 bytes.
2012-03-15 09:58:44.688324 7ff1cd48f700 log [ERR] : 97.5 scrub 1 errors
2012-03-15 09:58:45.288066 7ff1cdc90700 log [INF] : 1.65 scrub ok

--- 8-< ---

... and from other pools, too...
Additionally I got the first block from the image, looks almost identically to
an intact one, but with an intact one I could do a "fdisk -l", which might not
be an evidence at all... I'm not sure yet.

Healthy:

--- 8-< ---
root@fcmsnode3:/home/fcms# fdisk -l rb.0.c.000000000000
You must set cylinders.
You can do this from the extra functions menu.

Disk rb.0.c.000000000000: 0 MB, 0 bytes
255 heads, 63 sectors/track, 0 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0009851f

Device Boot      Start         End      Blocks   Id  System
rb.0.c.000000000000p1   *           1        6375    51198975+  83  Linux
Partition 1 has different physical/logical endings:
     phys=(1023, 254, 63) logical=(6374, 26, 51)

--- 8-< ---

no so healthy:

--- 8-< ---

root@fcmsnode3:/home/fcms# fdisk -l vm-451-rb.0.0.000000000000.rbd
root@fcmsnode3:/home/fcms#

--- 8-< ---

There are some informations already missing in this block... and the repeated
question: Why is there a change in the very first block... sigh

So, If I now create a header, put it back into the pool, I could be able to
export the image and then look via testdisk/fsck & friends how much data I can
save for the customer...

Some more ideas and clues very welcome,

Oliver.

Actions

Copy link

Updated by Oliver Francke about 12 years ago

Hi *,

any update on this topic? Cause we are working for hours and days with three people to rescue as many images as possible. Most of them have - at least - corrupted first blocks..., from "initramfs"-, "grub-rescue"-, "fsck"-errors.
Well, we cannot help them, but if one could identify this kind of error, this would raise the acceptance again.

Thnx n regards,

Oliver.

Actions

Copy link

Updated by Josh Durgin about 12 years ago

I looked at the block you attached, and compared it to the first 4MiB of my desktop's hard drive. It looks like it contains an MBR, but not a partition table. The output from gdisk seems to indicate that the actual partition table is in another object:

$ gdisk -l vm-451-rb.0.0.000000000000.rbd 
GPT fdisk (gdisk) version 0.6.14

Error seeking to or reading logical partition data from 39847934!
Aborting!
Partition table scan:
  MBR: MBR only
  BSD: not present
  APM: not present
  GPT: not present

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format.
***************************************************************

Warning! Secondary partition table overlaps the last partition by
39837736 blocks!
You will need to delete this partition or resize it in another utility.
Disk vm-451-rb.0.0.000000000000.rbd: 8185 sectors, 4.0 MiB
Logical sector size: 512 bytes
Disk identifier (GUID): A5909945-8E64-4170-881F-9A8045C60899
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 8151
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        39845887   19.0 GiB    0700  Linux/Windows data

This is similar to my sda, which contains the MBR, but stores the partition table on the first partition:

$ gdisk -l /dev/sda
sudo gdisk -l /dev/sda
GPT fdisk (gdisk) version 0.6.14

Partition table scan:
  MBR: MBR only
  BSD: not present
  APM: not present
  GPT: not present

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format.
***************************************************************

Disk /dev/sda: 156250000 sectors, 74.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 1B9FC214-F825-4B14-8759-3C74F61F5114
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 156249966
Partitions will be aligned on 2048-sector boundaries
Total free space is 5965 sectors (2.9 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048       149995519   71.5 GiB    0700  Linux/Windows data
   5       149997568       156248063   3.0 GiB     8200  Linux swap

$ gdisk -l /dev/sda1
GPT fdisk (gdisk) version 0.6.14

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries.
Disk /dev/sda1: 149993472 sectors, 71.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 3ED0C07D-011D-4372-8F0C-AE2105E9FC9B
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 149993438
Partitions will be aligned on 2048-sector boundaries
Total free space is 149993405 sectors (71.5 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name

I tried expanding the first block to 20G, and gdisk got further, but stopped when the signature in the partition table (at sector 39847934) was all zeroes:

$ truncate -s 20G vm-451-rb.0.0.000000000000.rbd
$ gdisk -l vm-451-rb.0.0.000000000000.rbd
GPT fdisk (gdisk) version 0.6.14

MBR signature in logical partition invalid; read 0x0000, but should be 0xAA55
Partition table scan:
  MBR: MBR only
  BSD: not present
  APM: not present
  GPT: not present

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format.
***************************************************************

Disk vm-451-rb.0.0.000000000000.rbd: 41943040 sectors, 20.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): BA326ACF-590A-49B0-890F-75FCA4E6344D
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 41943006
Partitions will be aligned on 2048-sector boundaries
Total free space is 2099133 sectors (1.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        39845887   19.0 GiB    0700  Linux/Windows data

This means the unbootable guests could be caused by missing objects later in the image, since a missing object is treated as zeroes by rbd.

Do other first objects have the same behavior as this one? That is, do they have MBRs pointing to other areas of the disk that might be missing objects?

Actions

Copy link

Updated by Oliver Francke about 12 years ago

Hi Josh,

thanks for taking the time to investigate this... And yes, many others show the same behaviour. Is "the next block" the next in the numeric sequence? That is ...0 -> ...1 ( or whatever comes next)?

Thnx n regards,

Oliver.

Actions

Copy link

Updated by Josh Durgin about 12 years ago

The next object is whatever the MBR points to. You can find the object name from the sector offset that gdisk gives you with:

perl -e 'printf "BLOCK_NAME_PREFIX.%012x\n", ((SECTOR_OFFSET * 512) / (4 * 1024 * 1024))'

In this case, BLOCK_NAME_PREFIX was rb.0.0, and gdisk told us the sector offset: "Error seeking to or reading logical partition data from 39847934!", so the logical partition header would be in the object named rb.0.0.000000001300

Actions

Copy link

Updated by Oliver Francke about 12 years ago

Good morning ;)

meanwhile I have not been lazy. I've managed - with current setup in test-cluster - to produce "inconsistencies" in the PG's, and:

1. write a file sequentially ( a file in each VM, in sum 4 VM's running)
2. stop/start OSD's while file is written
3. after all is synch'd, there a "holes" somewhere in the file, and not at the end.

@Sage Weil: we have another 2 images from customers with "no header" problem, would be cool if you'd have time to have a look at them 8-)

Kind regards,

Oliver.

Actions

Copy link

Updated by Oliver Francke about 12 years ago

Well,

one more comment:
my guess would be, it has todo something with expansion of the "sparse-file" while writing new files, what I currently do with my test-script.

Best regards,

Oliver.

Actions

Copy link

Updated by Sage Weil about 12 years ago

The next step is to reproduce the corruption on the test cluster with logs:

debug osd = 20
 debug ms = 1
 debug filestore = 20
 debug journal = 10

would be ideal.

My understanding of the process is:
- write a bunch of data inside the vm
- kill + restart ceph-osd out from underneath
- verify the written data inside the vm ---> discover zeros

One theory is that this is the read/write reordering problem with 'rbd writeback window' + recovery.

Actions

Copy link

Updated by Oliver Francke about 12 years ago

File cephnode3.crash-with-debug.0.44.1.log.bz2 cephnode3.crash-with-debug.0.44.1.log.bz2 added

Hi,

I decided to upgrade to "latest-n-greatest" in the test-cluster, to make sure, that if I hit the error again we are not dealing with known-errors. But: after a "ceph osd repair 1" or "ceph pg repair 0.69" osd tends to crash.
Logfile with debugging is attached.
Not sure if I should continue my testing at this point?!

Kind regards,

Oliver.

Actions

Copy link

#10

Updated by Sage Weil about 12 years ago

Okay, I suspect this is actually bug #2164, which was causing the _ xattr to get lost when ceph-osd restarts on non-btrfs. Running the fix through some final testing before we push it.

This may explain the rbd-visible behavior, too. From reading the code, it looks like the OSD will reply with -ENODATA, and librbd will return that to the read_iterate() caller... not sure what qemu will do with it... Josh?

Actions

Copy link

#11

Updated by Oliver Francke about 12 years ago

Hi *,

I needed a couple of runs, but managed now to provide some 81MiB/97MiB osd.X.log-files, where in between sh.t happened, that is: I had 4 VM's running, each writing some 8KiB pages into a file, so growing, and the verification shows now a couple of 4MiB-holes within the file. Some dataloss even at the end.
I did my own logging from what I have written and when with gettimeofday, so µ-seconds granularity. All nodes/VM's are ntp-synchronized, so perhaps this is useful to narrow-down some problems.

@Sage Weil: I put the logfiles in root@fcmsnode3:osd.x-...

Here the relevant part from one logfile:

20120330 110203.287078 block= 10527 seekpos=    86237184 BLOCKSIZE=    8192 nread=    8192
and then ^@ = NULL = empty but readable up to:
  20120330 110203.595194 block= 11039 seekpos=    90431488 BLOCKSIZE=    8192 nread=    8192
that is 512 * 8192 = 4MiB filled with NULLs.

Block 10527 + 10528 were written sucessfully at:
20120330 105033.665610 filling block 10527/128000, seekpos=86237184
20120330 105033.665731 filling block 10528/128000, seekpos=86245376

so this should be the pointer into the corresponding osd-logfiles.
The outages were produced by "ceph pg repair 0.69", waited until monitor got aware of outage and marked down, then restart followed... this 2-3 times...

Hope this helps,

Oliver.

Actions

Copy link

#12

Updated by Oliver Francke about 12 years ago

Here the remaining timestamps from the other VM's with bad blocks:

VM-2:
20120330 105139.579830 filling block 17177/128000, seekpos=140713984

VM-3:
20120330 105032.316980 filling block 9314/128000, seekpos=76300288

VM-4:
20120330 105033.361560 filling block 18899/128000, seekpos=154820608

so some 3 similar timestamps from 1+3+4.

Best regards,

Oliver.

Actions

Copy link

#13

Updated by Sage Weil about 12 years ago

Please build the current git stable branch, which includes 41f84fac1ae4b4c72bf9bfe07614c4066c916fd1. The version should be v0.44.1-1-g41f84fa. And then try to reproduce the process of corrupting things.

Note that this doesn't fix the already-broken cluster; it should prevent a new corruption. So either use a fresh pool, or wipe the test cluster first.

Please let us know if that behaves better!

Actions

Copy link

#14

Updated by Sage Weil about 12 years ago

Status changed from Need More Info to 7

Actions

Copy link

#15

Updated by Oliver Francke about 12 years ago

Hi Sage,

here we go again, with ceph-0.44.1-1-g41f84fa

One bad file with following infos:

20120402 171642.122776 block= 76506 seekpos= 626737152 BLOCKSIZE= 8192 nread= 8192
mangled up to:
20120402 171642.176825 block= 76891 seekpos= 629891072 BLOCKSIZE= 8192 nread= 8192

so around 3080 KiB filled with NULLs.
Logfiles osd.[01].log-ceph-0.44.1-1-g41f84fa.bz2 are on their way to root@fcmsnode3:...

What else do you need?

Kind regards,

Oliver.

Actions

Copy link

#16

Updated by Sage Weil about 12 years ago

Hi Oliver,

I have two things to try:

- 'rbd writeback window = 0'. I know it's not what you want to run, but there is a known problem with the window and osd restart in certain situations, and I want to know if that is what is happening here.

- 'filestore fiemap threshold = 8000000' in the [osd] sections (and restart the osds). This will rule out the possibility of a FIEMAP problem with xfs.

Thanks!

Actions

Copy link

#17

Updated by Oliver Francke about 12 years ago

Well Sage,

its harder these days to reproduce, cause I think the current version has made "something more stable"(tm).
After 10 runs or so I have two VMs with multiple "holes" in them...
So the provided logfiles from monday do not show anything? That would be quite contraproductive... Should next test-runs made w/ or w/o debug-logs enabled?
The rbd_writeback_window.... well, IIRC a Josh Pieper was writing about a similar complain from a windows-VM to the ceph-devel-list, where he looses files from some test-scripts... without the caching the problem is gone, but all awfully slow. If you/Josh D. suspect, it's a reordering prob or s/t like this, I would need perhaps another 20 runs to see... it's running fine... or do I need 30? And why are there no hints in the logfiles? ;)

You see where I come from, will start-off the next runs with some "fiemap" settings...

Thnx and best regards,

Oliver.

Actions

Copy link

#18

Updated by Oliver Francke about 12 years ago

Whew, that was fast,

after second run I had some errors in one file with:
[osd]
filestore fiemap threshold = 8000000

setting applied to both OSDs ( not in osd.X-section, right?)

Kind regards,

Oliver.

Actions

Copy link

#19

Updated by Sage Weil about 12 years ago

Oliver Francke wrote:

Whew, that was fast,

after second run I had some errors in one file with:
[osd]
filestore fiemap threshold = 8000000

setting applied to both OSDs ( not in osd.X-section, right?)

Is that is with 'rbd writeback window = 0' also? Can you put the logs somewhere?

Actions

Copy link

#20

Updated by Oliver Francke about 12 years ago

Hi Sage,

I was talking about the verbose logfiles from monday. TBH, I don't expect BadThings without "rbd_writeback_window", running now 5 times without issues... The logfiles are on root@fcmsnode3...
The recent NULLs came after setting the FIEMAP, but without verbose logs.
Next steps now welcome ;)

Regards,

Oliver.

Actions

Copy link

#21

Updated by Sage Weil about 12 years ago

Oliver Francke wrote:

Hi Sage,

I was talking about the verbose logfiles from monday. TBH, I don't expect BadThings without "rbd_writeback_window", running now 5 times without issues... The logfiles are on root@fcmsnode3...

If that's the case then we're in the clear... the writeback window issue is a known problem and will be obsolete in 0.46 with the new caching code.

The recent NULLs came after setting the FIEMAP, but without verbose logs.
Next steps now welcome ;)

That was with writeback window still enabled?

If you can reproduce the corruption with writeback window = 0, that is what we want to look at carefully.

Actions

Copy link

#22

Updated by Oliver Francke about 12 years ago

Uhm...

... I thought, we were talking about the same issue since the very beginning... corruption of .rbd-blocks... And now it's all clear? sigh, what a waste of time I put into this issue.
Are you able to provide a patch already to be tested by my evil script? ( That's the only advantage I see. I'm able to reproduce things now in a test-bed...)

Regards,

Oliver.

Actions

Copy link

#23

Updated by Sage Weil about 12 years ago

Oliver Francke wrote:

Uhm...

... I thought, we were talking about the same issue since the very beginning... corruption of .rbd-blocks... And now it's all clear? sigh, what a waste of time I put into this issue.

We are talking about the same time, I just need to keep straight exactly which configuration settings were active when you have seen corruption. To summarize, you've seen it

- with rbd writeback window != 0
- with and without the latest patch in stable
- with and without the rbd fiemap threshold adjustment.

And (can you please confirm that) you haven't yet seen it
- with rbd writeback window = 0

Are you able to provide a patch already to be tested by my evil script? ( That's the only advantage I see. I'm able to reproduce things now in a test-bed...)

That's my goal, as soon as I can confirm we're not hitting the writeback window issue. (Being able to reproduce this is huge, I'm extremely grateful for that!)

Actions

Copy link

#24

Updated by Sage Weil about 12 years ago

Ok, my attempts to parse the log to find out of order replies is quickly snowballing. (complexity of dropped replies due to peering, etc.) Can you please est against wip-osd-reorder branch, v0.44.1-2-g23f95a9, with rbd writeback window still enabled, and see if you still see the problem? <fingers crossed that it goes away>

Thanks!

Actions

Copy link

#25

Updated by Oliver Francke about 12 years ago

Hi Sage and Happy easter,

yesterday I had some "luck" after 10 tries....

Here is what I have for you:
first mangled block at:

20120408 164944.800905 block= 56123 seekpos=   459759616 BLOCKSIZE=    8192 nread=    8192 md5sum=

and then NULL's for exactly 2MiB.
block was written at approx.:
20120407 174903.815980 filling block 56123/128000, seekpos=459759616
at this point a osd-restart seemed to have occurred:

2012-04-07 17:49:01.111170 7f0cd3f9f700 filestore(/data/osd1) sync_entry commit took 0.292351, interval was 4.828282
2012-04-07 17:49:01.111234 7f0cd3f9f700 filestore(/data/osd1) sync_entry committed to op_seq 12791086
2012-04-07 17:49:01.111612 7f0cd722a780 journal close /dev/sdb1
2012-04-07 17:49:01.113171 7f0cd722a780 filestore(/data/osd1) test_mount basedir /data/osd1 journal /dev/sdb1
2012-04-07 17:49:01.113222 7f0cd722a780 -- 0.0.0.0:6800/6674 messenger.start
2012-04-07 17:49:01.113232 7f0cd722a780 -- 0.0.0.0:6800/6674 accepter.start
2012-04-07 17:49:01.113270 7f0cd722a780 -- :/0 messenger.start
2012-04-07 17:49:01.113288 7f0cd722a780 -- 0.0.0.0:6804/6674 messenger.start
2012-04-07 17:49:01.113295 7f0cd722a780 -- 0.0.0.0:6804/6674 accepter.start
2012-04-07 17:49:01.113334 7f0cd722a780 -- 0.0.0.0:6802/6674 messenger.start
2012-04-07 17:49:01.113342 7f0cd722a780 -- 0.0.0.0:6802/6674 accepter.start
2012-04-07 17:49:01.113486 7f0cd722a780 osd.1 0 mounting /data/osd1 /dev/sdb1
2012-04-07 17:49:01.113496 7f0cd722a780 filestore(/data/osd1) basedir /data/osd1 journal /dev/sdb1
2012-04-07 17:49:01.113533 7f0cd722a780 filestore(/data/osd1) mount FIEMAP ioctl is NOT supported

and then up to 17:49:05 almost quite, so the need for replay and so the mangling seems to have fallen into this period.

According logfiles you find on root@fcmsnode3:2178 in a while...

Good luck and best regards,

Oliver.

Actions

Copy link

#26

Updated by Sage Weil about 12 years ago

The missing piece of information is mapping the file offset to a block device offset. Can you, inside the VM,

wget http://ceph.com/qa/fiemap.c
wget http://ceph.com/qa/fiemap.h
wget http://ceph.com/qa/linux_fiemap.h
gcc fiemap.c -o fiemap
./fiemap /path/to/file > /tmp/foo.txt

and stick that info somewhere I can get it?
Thanks!

Actions

Copy link

#27

Updated by Oliver Francke about 12 years ago

Hi Sage,

just in case, the reply from yesterday did not reach you:

--- 8-< ---

Good morning,

it's already there:

root@fcmsnode3:2178/fiemap.tmp.woZHE1V8nK

That's from inside the VM from the tmp-file.

Best regards,

Oliver.

--- 8-< ---

Oliver.

Actions

Copy link

#28

Updated by Sage Weil about 12 years ago

the good news is i see the problem. the bad news is its the exact bug we thought we fixed. the other good news is we're almost done with a torture test designed to validate exactly this code.

i'll have a fix to test shortly.

Actions

Copy link

#29

Updated by Oliver Francke about 12 years ago

Well Sage,

I have a torture-test already :-D
OK, so it's independent from yours and that's good. It sounds, we are on the right way now, and both heading the same direction ;)

Thnx and best regards,

Oliver.

Actions

Copy link

#30

Updated by Sage Weil about 12 years ago

The most recent occurrence has been confirmed to be a replay issue with non-btrfs filesystems. The wip-guard branch fixes all of those known bugs, based on the tester that Joao has been working on. Can you retest with the latest wip-osd-reorder branch?

git fetch
git reset --hard origin/wip-osd-reorder
git describe

The version should be v0.44.2-184-g46c1b86.

If we can confirm that this problem "goes away" with that commit, I am satisfied that the old bugs are fixed. At which point we should move on to torturing the new rbd caching code which replaces 'rbd writeback window'. That has been merged into the latest master and is slated for 0.46. (In fact, even if you can still break wip-osd-reorder, we're be interested in hearing whether you can break the current master!)

Actions

Copy link

#31

Updated by Oliver Francke about 12 years ago

Hi Sage,

here my notes, after almost 40 tests no bad things happened, only once a minor hickup, where the rbd-header remained in the pool...
So, what's next, rbd_cache_enable, yeah.
Just tried with v0.45-207-g3053e47 and an entry ala:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_cache_enabled=1
instead of:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_writeback_cache=....
but not avail.
What do I miss here? Tried with:
v0.45-207-g3053e47

Thnx in @vance,

Oliver.

Actions

Copy link

#32

Updated by Oliver Francke about 12 years ago

Meanwhile continued to test...:

I noticed some negative degredation:

2012-04-18 14:43:37.282634 pg v128104: 352 pgs: 347 active+clean, 5 active+recovering; 58301 MB data, 113 GB used, 246 GB / 359 GB avail; -90/31296 degraded (-0.288%)
2012-04-18 14:43:38.309982 pg v128105: 352 pgs: 350 active+clean, 2 active+recovering; 58141 MB data, 113 GB used, 246 GB / 359 GB avail; -70/31216 degraded (-0.224%)
2012-04-18 14:43:42.285829 pg v128106: 352 pgs: 350 active+clean, 2 active+recovering; 57707 MB data, 111 GB used, 248 GB / 359 GB avail; -70/30994 degraded (-0.226%)

and, the second time an image got not completely removed:
(before next message an rbd rm data/907-testdisk.rbd is performed)

2012-04-18 14:43:42.001916 7f61ccee2760 librbd: rbd image header 907-testdisk.rbd.rbd already exists
create error: (17) File exists
Error creating data/907-testdisk.rbd

looks like the "rm" was delayed, so the "create" got onto this error?
Any ideas?

Kind regards,

Oliver.

Actions

Copy link

#33

Updated by Oliver Francke about 12 years ago

File osd.1-907-testdisk-exists.log.bz2 osd.1-907-testdisk-exists.log.bz2 added

Here is a sequence copy-n-pasted:

rbd rm data/905-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/905-testdisk.rbd
rbd rm data/906-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/906-testdisk.rbd
2012-04-18 15:40:55.321093 7fbdcae3d760 librbd: rbd image header 906-testdisk.rbd.rbd already exists
create error: (17) File exists
Error 0 creating data/906-testdisk.rbd

So now I stopped testing. I should have attached two logfiles where former delete of 907-testdisk.rbd failed...

Hope it helps,

Oliver.

Actions

Copy link

#34

Updated by Oliver Francke about 12 years ago

File osd.0-907-testdisk-exists.log.bz2 osd.0-907-testdisk-exists.log.bz2 added

second logfile here, sorry.

Actions

Copy link

#35

Updated by Sage Weil about 12 years ago

Oliver Francke wrote:

Hi Sage,

here my notes, after almost 40 tests no bad things happened, only once a minor hickup, where the rbd-header remained in the pool...

Yay, great news! Looks like the mystery is solved.

So, what's next, rbd_cache_enable, yeah.
Just tried with v0.45-207-g3053e47 and an entry ala:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_cache_enabled=1
instead of:
virtio0: rbd:data/vm-905-disk-1.rbd:rbd_writeback_cache=....
but not avail.
What do I miss here? Tried with:
v0.45-207-g3053e47

'rbd cache = true'

Actions

Copy link

#36

Updated by Sage Weil about 12 years ago

Oliver Francke wrote:

Here is a sequence copy-n-pasted:

rbd rm data/905-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/905-testdisk.rbd
rbd rm data/906-testdisk.rbd
Removing image: 100% complete...done.
rbd create --size 2048 data/906-testdisk.rbd
2012-04-18 15:40:55.321093 7fbdcae3d760 librbd: rbd image header 906-testdisk.rbd.rbd already exists
create error: (17) File exists
Error 0 creating data/906-testdisk.rbd

Is it possible there is some other user, or the logs are from the wrong cluster?

I see:
- client.13507 deletes 906-testdisk.rbd (successfully)
- client.13508 creates 906-testdisk.rbd (presumably successfully)

I went looking for ways that 13508 could have gotten that error, and it is only generated at the beginning of the create sequence, before it does some subsequent work.. and I've verified that the subsequent work was done.

What commit are you running? Maybe i'm looking at the wrong version of the code.

Actions

Copy link

#37

Updated by Sage Weil about 12 years ago

Status changed from 7 to Resolved

moved this new issue to #2311, and resolving this bug. hooray!

Actions

Copy link

#38

Updated by Oliver Francke about 12 years ago

Hi Sage,

sorry, was not clear enough. The logfiles provide informations for "907-testdisk.rbd..." not "906..."
The version was from the continued tests with former 0.44-2 wip-reorder...

Hope it helps,

Oliver.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #2178

rbd: corruption of first block

Updated by Oliver Francke about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Josh Durgin about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Josh Durgin about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Oliver Francke about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Sage Weil about 12 years ago

Updated by Oliver Francke about 12 years ago