Project

General

Profile

Fix #5388

osd: localized reads (from replicas) reordered wrt writes

Added by Mike Bryant over 10 years ago. Updated about 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm using hbase, with the hadoop-cephfs bindings, on top of a ceph 0.61 cluster.
I'm seeing instances where reading part of a file is returning all nulls, instead of the valid data.
This is showing up at the application level where it's trying to validate a magic number is there, failing, and crashing.

I've been running debugging the the hadoop bindings, and it appears to be doing the right thing, and just getting back nulls from the libcephfs layer.
The file it's reading incorrect data from is: ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/t/3847239684462394768

This is the interesting bit from the regionserver log:
2013-06-17 19:01:46,984 INFO org.apache.hadoop.fs.ceph.CephFileSystem: selectDataPool path=ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/.tmp/3566091379609591353 pool:repl=data:4 wanted=3
2013-06-17 19:01:47,172 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Bloom added to HFile (ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/.tmp/3566091379609591353): 4.5k, 3687/3841 (96%)
2013-06-17 19:01:47,233 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/.tmp/3566091379609591353 to ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/t/3847239684462394768
2013-06-17 19:01:47,841 DEBUG org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream constructor: initializing stream with fh 2651 and file length 2597375
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.seek: Seeking to position 2597315 on fd 2651
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: Reading 8 bytes from fd 2651
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: Reading 8 bytes from fd 2651: succeeded in reading 8 bytes
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: 8 byte output: 0000000000000000
(This is the bit where it's reading the trailer, and it should be some bytes corresponding to TRABLK)
2013-06-17 19:01:47,842 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=ec02sv14.ocado.com,60020,1371491732490, load=(requests=264597, regions=132, usedHeap=2709, maxHeap=4062): Replay of HLog required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,\x00\x00fQ\xB1\xE7`\x00\x00\x01\x00\x0E\xC2\x00\x00\x03\x00\x00\xC1,1371473213321.e50e991f7082c47e9172c5421af6e5f3.
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1070)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:967)
at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:915)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:394)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:202)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:222)
Caused by: java.io.IOException: Trailer 'header' is wrong; does the trailer size match content?
at org.apache.hadoop.hbase.io.hfile.HFile$FixedFileTrailer.deserialize(HFile.java:1527)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readTrailer(HFile.java:885)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:819)
at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.loadFileInfo(StoreFile.java:1003)
at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:382)
at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:438)
at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:557)
at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:496)
at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:83)
at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1576)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1046)
... 5 more

For comparison this is an instance where it was fine:
2013-06-17 19:01:45,872 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: Reading 8 bytes from fd 2644: succeeded in rea
ding 8 bytes
2013-06-17 19:01:45,872 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: 8 byte output: 545241424c4b2224

When I look at the file it was reading from a fuse mountpoint, I can see the correct sequence of bytes.

History

#1 Updated by Sage Weil over 10 years ago

  • Subject changed from Reading data returns nulls to libcephfs: reads return null instead of recently written data
  • Status changed from New to Need More Info

Hi Mike,

I gather that the same data was just written by a different node in the cluster? And this is right near/at the end of file? The logs that we need to debug this are

debug ms = 1
debug client = 20

on the client side. Adding the above to [client] in the ceph.conf on the clients and ensuring they are logging (by default it's /var/log/ceph/ceph-ceph.client.$id.$pid.log but you may have to futz with permissions or paths or something in your environment). We'll then need to see the logs from both the writer(s) and the failing reader.

Thanks!

#2 Updated by Mike Bryant over 10 years ago

I've just reproduced it with those log levels.
There was 1 master, 1 regionserver.
So I think the writing and reading both happened on the regionserver.

The data is right near the end of the file (the trailer is 60 bytes long, the 8 it's choking on is at the start of that region)

I've uploaded the data to http://xelix.net/ceph/5388/, since there seems to be a filesize limit on the tracker.

#3 Updated by Sage Weil over 10 years ago

  • Assignee set to Sage Weil

#4 Updated by Sage Weil over 10 years ago

  • Status changed from Need More Info to 12

#5 Updated by Sage Weil over 10 years ago

flab:5388 08:57 PM $ grep -- -- 1000002d435 |grep osd_op
2013-06-18 17:01:16.972976 7f2df85f6700  1 -- 10.248.32.14:0/1956904746 --> 10.249.32.13:6800/61227 -- osd_op(client.43498.0:33052 1000002d435.00000000 [write 0~2087967] 0.c993c8ec snapc 1=[] localize_reads e186) v4 -- ?+0 0x7f2e402eb8b0 con 0x7f2e5c108920
2013-06-18 17:01:18.357697 7f2df8ff9700  1 -- 10.248.32.14:0/1956904746 <== osd.0 10.249.32.13:6800/61227 95 ==== osd_op_reply(33052 1000002d435.00000000 [write 0~2087967] ondisk = 0) v4 ==== 119+0+0 (2441306796 0 0) 0x7f2e74323d50 con 0x7f2e5c108920
2013-06-18 17:01:18.382586 7f2df85f6700  1 -- 10.248.32.14:0/1956904746 --> 10.248.32.14:6800/3856 -- osd_op(client.43498.0:33087 1000002d435.00000000 [read 2087907~60] 0.c993c8ec localize_reads e186) v4 -- ?+0 0x7f2e4158f480 con 0x7f2e3c078340
2013-06-18 17:01:18.389670 7f2df8ff9700  1 -- 10.248.32.14:0/1956904746 <== osd.2 10.248.32.14:6800/3856 32742 ==== osd_op_reply(33087 1000002d435.00000000 [read 2087907~60] ack = -2 (No such file or directory)) v4 ==== 119+0+0 (956439537 0 0) 0x7f2de3f05dc0 con 0x7f2e3c078340

It looks to me like the problem is the localized reads. I think the replica OSD has the writes committed to the journal but isn't ordering reads vs writes properly, allowing the client to see the stale fs state (before the object was written).

The quick workaround here is to disable localized reads. I suspect the fix on the OSD is going to be a bit more involved.

#6 Updated by Sage Weil over 10 years ago

  • Project changed from CephFS to Ceph
  • Subject changed from libcephfs: reads return null instead of recently written data to osd: localized reads (from replicas) reordered wrt writes
  • Category changed from 44 to OSD
  • Assignee deleted (Sage Weil)
  • Priority changed from Urgent to High

#7 Updated by Mike Bryant over 10 years ago

Since disabling localized reads I've not seen the problem occur, so thanks :)

#8 Updated by Sage Weil over 10 years ago

  • Target version set to v0.73

#9 Updated by Sage Weil over 10 years ago

  • Tracker changed from Bug to Fix

#10 Updated by Samuel Just over 10 years ago

  • Target version deleted (v0.73)

#11 Updated by Patrick Donnelly about 4 years ago

  • Status changed from 12 to New

Also available in: Atom PDF