Project

General

Profile

Bug #19863

bluestore deferred write crc changes before write

Added by Sage Weil 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Immediate
Target version:
-
Start date:
05/04/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
kraken,jewel,hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

write is set up...


2017-05-04 14:43:09.742671 7f1b62817700 30 bluestore(/var/lib/ceph/osd/ceph-4) _dump_extent_map  0x0~2a2: 0x0~2a2 Blob(0x7f1b8fedb5e0 blob([0x15cfc4000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0xfff0) use_tracker(0x4000 0x2a2) SharedBlob(0x7f1b8fedb6c0 sbid 0x0))
2017-05-04 14:43:09.742674 7f1b62817700 30 bluestore(/var/lib/ceph/osd/ceph-4) _dump_extent_map      csum: [717b54fd,0,0,0]
...
2017-05-04 14:43:09.742698 7f1b62817700 20 bluestore(/var/lib/ceph/osd/ceph-4) _do_write_small  deferred write 0x0~1000 of mutable Blob(0x7f1b8fedb5e0 blob([0x15cfc4000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0xfff0) use_tracker(0x4000 0x2a2) SharedBlob(0x7f1b8fedb6c0 sbid 0x0)) at [0x15cfc4000~1000]
...
2017-05-04 14:43:09.755870 7f1b6a026700 10 bluestore(/var/lib/ceph/osd/ceph-4) _txc_state_proc txc 0x7f1b9f3262c0 kv_submitted
2017-05-04 14:43:09.755872 7f1b6a026700 20 bluestore(/var/lib/ceph/osd/ceph-4) _txc_committed_kv txc 0x7f1b9f3262c0
2017-05-04 14:43:09.755878 7f1b6a026700 20 bluestore(/var/lib/ceph/osd/ceph-4) _deferred_queue txc 0x7f1b9f3262c0 osr 0x7f1b89435dc0
2017-05-04 14:43:09.755882 7f1b6a026700 20 bluestore.DeferredBatch(0x7f1b8fce17a0) _discard  drop 10929 0x15cfc4000~1000
2017-05-04 14:43:09.755887 7f1b6a026700 20 bluestore.DeferredBatch(0x7f1b8fce17a0) prepare_write seq 10930 0x15cfc4000~1000 crc dbea3c2c

note hte crc is dbea...
later when we queue it,
2017-05-04 14:43:09.767407 7f1b6a026700 20 bluestore(/var/lib/ceph/osd/ceph-4) _deferred_try_submit 3 osrs, 16 txcs
2017-05-04 14:43:09.767409 7f1b6a026700 10 bluestore(/var/lib/ceph/osd/ceph-4) _deferred_submit osr 0x7f1b89435dc0 1 ios pending 
2017-05-04 14:43:09.767414 7f1b6a026700 20 bluestore(/var/lib/ceph/osd/ceph-4) _deferred_submit   seq 10930 0x15cfc4000~1000
2017-05-04 14:43:09.767417 7f1b6a026700 20 bluestore(/var/lib/ceph/osd/ceph-4) _deferred_submit write 0x15cfc4000~1000 crc dcca195e
2017-05-04 14:43:09.767423 7f1b6a026700 20 bdev(0x7f1b869a2e00 /var/lib/ceph/osd/ceph-4/block) aio_write 0x15cfc4000~1000 (direct)
2017-05-04 14:43:09.767427 7f1b6a026700 20 bdev(0x7f1b869a2e00 /var/lib/ceph/osd/ceph-4/block) aio_write rebuilding buffer to be aligned
2017-05-04 14:43:09.767428 7f1b6a026700 20 bdev(0x7f1b869a2e00 /var/lib/ceph/osd/ceph-4/block) _aio_log_start 0x15cfc4000~1000
2017-05-04 14:43:09.767430 7f1b6a026700  5 bdev(0x7f1b869a2e00 /var/lib/ceph/osd/ceph-4/block) aio_write 0x15cfc4000~1000 aio 0x7f1b95e20100

the crc has changed to dcca...

and then when we print the blob we have the original crc,

2017-05-04 14:43:09.757043 7f1b62817700 30 bluestore(/var/lib/ceph/osd/ceph-4) _dump_extent_map  0x0~2a2: 0x0~2a2 Blob(0x7f1b8fedb5e0 blob([0x15cfc4000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0xfff0) use_tracker(0x4000 0x2a2) SharedBlob(0x7f1b8fedb6c0 sbid 0x0))
2017-05-04 14:43:09.757048 7f1b62817700 30 bluestore(/var/lib/ceph/osd/ceph-4) _dump_extent_map      csum: [dbea3c2c,0,0,0]

later when we read it,

2017-05-04 14:43:09.774723 7f1b62817700 -1 bluestore(/var/lib/ceph/osd/ceph-4) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xdcca195e, expected 0xdbea3c2c, device location [0x15cfc4000~1000], logical extent 0x0~1000, object #1:459186ea:::rbd_object_map.10252b0d8dbe:head#

/a/sage-2017-05-04_02:24:30-rbd-wip-snapset---basic-smithi/1098816

History

#1 Updated by Sage Weil 7 months ago

/a/sage-2017-05-05_22:43:52-rbd-wip-sage-testing---basic-smithi/1105005

#2 Updated by Sage Weil 7 months ago

  • Status changed from New to In Progress

both instances were rbd. trying to reproduce with some additional logging that hexdumps the buffers.

#3 Updated by Sage Weil 7 months ago

it's an objectmap object. the block in question looks like this:


00000000  0e 00 00 00 01 01 08 00  00 00 00 0a 00 00 00 00  |................|
00000010  00 00 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |..UUUUUUUUUUUUUU|
00000020  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000030  55 55 50 00 00 00 00 00  00 00 00 00 00 00 00 00  |UUP.............|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000b0  00 00 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |..UUUUUUUUUUUUUU|
000000c0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000000d0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000000e0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000000f0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000100  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000110  55 55 40 00 00 00 00 00  00 00 00 00 00 00 00 00  |UU@.............|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000150  00 00 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |..UUUUUUUUUUUUUU|
00000160  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000170  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000180  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000190  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000001a0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000001b0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000001c0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000001d0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000001e0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
000001f0  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000200  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000210  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000220  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000230  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000240  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000250  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000260  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000270  55 55 55 55 55 55 55 55  55 55 00 00 00 00 00 00  |UUUUUUUUUU......|
00000280  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000290  00 01 0c 00 00 00 91 1e  9c 74 01 00 00 00 68 74  |.........t....ht|
000002a0  e6 37 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |.7..............|
000002b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000ff0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001000

the diff before/after the corruption is
< 00000270  55 55 55 55 55 55 55 55  55 54 00 00 00 00 00 00  |UUUUUUUUUT......|
---
> 00000270  55 55 55 55 55 55 55 55  55 55 00 00 00 00 00 00  |UUUUUUUUUU......|

My guess is cls_rbd has two ops in flight and both are writing out bufferlist that point to the same buffer in memory. The earlier write changes "in flight" after the second is prepared and we only notice because of the crc mismatch.

Assuming that's correct, the fix is just to copy the changed block(s) to a new buffer before writing them. (The buffer will stick around indefinitely in the bluestore buffer cache.)

Alternatively cls_rbd could try to get clever by doing cow internally based on the raw buffers ref count, but I'm not sure it's worth the complexity/risk!

#4 Updated by Jason Dillaman 7 months ago

  • Project changed from Ceph to rbd
  • Assignee changed from Sage Weil to Jason Dillaman

#5 Updated by Jason Dillaman 7 months ago

  • Status changed from In Progress to Need Review
  • Backport set to kraken,jewel,hammer

#6 Updated by Sage Weil 7 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF