Fix #5388: osd: localized reads (from replicas) reordered wrt writes - Ceph - Ceph

Actions

Copy link

Fix #5388

open

osd: localized reads (from replicas) reordered wrt writes

Added by Mike Bryant almost 11 years ago. Updated over 4 years ago.

Status:

New

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm using hbase, with the hadoop-cephfs bindings, on top of a ceph 0.61 cluster.
I'm seeing instances where reading part of a file is returning all nulls, instead of the valid data.
This is showing up at the application level where it's trying to validate a magic number is there, failing, and crashing.

I've been running debugging the the hadoop bindings, and it appears to be doing the right thing, and just getting back nulls from the libcephfs layer.
The file it's reading incorrect data from is: ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/t/3847239684462394768

This is the interesting bit from the regionserver log:
2013-06-17 19:01:46,984 INFO org.apache.hadoop.fs.ceph.CephFileSystem: selectDataPool path=ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/.tmp/3566091379609591353 pool:repl=data:4 wanted=3
2013-06-17 19:01:47,172 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Bloom added to HFile (ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/.tmp/3566091379609591353): 4.5k, 3687/3841 (96%)
2013-06-17 19:01:47,233 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/.tmp/3566091379609591353 to ceph://null/hbase/tsdb/e50e991f7082c47e9172c5421af6e5f3/t/3847239684462394768
2013-06-17 19:01:47,841 DEBUG org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream constructor: initializing stream with fh 2651 and file length 2597375
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.seek: Seeking to position 2597315 on fd 2651
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: Reading 8 bytes from fd 2651
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: Reading 8 bytes from fd 2651: succeeded in reading 8 bytes
2013-06-17 19:01:47,841 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: 8 byte output: 0000000000000000
(This is the bit where it's reading the trailer, and it should be some bytes corresponding to TRABLK)
2013-06-17 19:01:47,842 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=ec02sv14.ocado.com,60020,1371491732490, load=(requests=264597, regions=132, usedHeap=2709, maxHeap=4062): Replay of HLog required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,\x00\x00fQ\xB1\xE7`\x00\x00\x01\x00\x0E\xC2\x00\x00\x03\x00\x00\xC1,1371473213321.e50e991f7082c47e9172c5421af6e5f3.
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1070)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:967)
at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:915)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:394)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:202)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:222)
Caused by: java.io.IOException: Trailer 'header' is wrong; does the trailer size match content?
at org.apache.hadoop.hbase.io.hfile.HFile$FixedFileTrailer.deserialize(HFile.java:1527)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readTrailer(HFile.java:885)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:819)
at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.loadFileInfo(StoreFile.java:1003)
at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:382)
at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:438)
at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:557)
at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:496)
at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:83)
at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1576)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1046)
... 5 more

For comparison this is an instance where it was fine:
2013-06-17 19:01:45,872 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: Reading 8 bytes from fd 2644: succeeded in rea
ding 8 bytes
2013-06-17 19:01:45,872 TRACE org.apache.hadoop.fs.ceph.CephInputStream: CephInputStream.read: 8 byte output: 545241424c4b2224

When I look at the file it was reading from a fuse mountpoint, I can see the correct sequence of bytes.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Subject changed from Reading data returns nulls to libcephfs: reads return null instead of recently written data
Status changed from New to Need More Info

Hi Mike,

I gather that the same data was just written by a different node in the cluster? And this is right near/at the end of file? The logs that we need to debug this are

debug ms = 1
 debug client = 20

on the client side. Adding the above to [client] in the ceph.conf on the clients and ensuring they are logging (by default it's /var/log/ceph/ceph-ceph.client.$id.$pid.log but you may have to futz with permissions or paths or something in your environment). We'll then need to see the logs from both the writer(s) and the failing reader.

Thanks!

Actions

Copy link

Updated by Mike Bryant almost 11 years ago

I've just reproduced it with those log levels.
There was 1 master, 1 regionserver.
So I think the writing and reading both happened on the regionserver.

The data is right near the end of the file (the trailer is 60 bytes long, the 8 it's choking on is at the start of that region)

I've uploaded the data to http://xelix.net/ceph/5388/, since there seems to be a filesize limit on the tracker.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Assignee set to Sage Weil

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Status changed from Need More Info to 12

Actions

Copy link

Updated by Sage Weil almost 11 years ago

flab:5388 08:57 PM $ grep -- -- 1000002d435 |grep osd_op
2013-06-18 17:01:16.972976 7f2df85f6700  1 -- 10.248.32.14:0/1956904746 --> 10.249.32.13:6800/61227 -- osd_op(client.43498.0:33052 1000002d435.00000000 [write 0~2087967] 0.c993c8ec snapc 1=[] localize_reads e186) v4 -- ?+0 0x7f2e402eb8b0 con 0x7f2e5c108920
2013-06-18 17:01:18.357697 7f2df8ff9700  1 -- 10.248.32.14:0/1956904746 <== osd.0 10.249.32.13:6800/61227 95 ==== osd_op_reply(33052 1000002d435.00000000 [write 0~2087967] ondisk = 0) v4 ==== 119+0+0 (2441306796 0 0) 0x7f2e74323d50 con 0x7f2e5c108920
2013-06-18 17:01:18.382586 7f2df85f6700  1 -- 10.248.32.14:0/1956904746 --> 10.248.32.14:6800/3856 -- osd_op(client.43498.0:33087 1000002d435.00000000 [read 2087907~60] 0.c993c8ec localize_reads e186) v4 -- ?+0 0x7f2e4158f480 con 0x7f2e3c078340
2013-06-18 17:01:18.389670 7f2df8ff9700  1 -- 10.248.32.14:0/1956904746 <== osd.2 10.248.32.14:6800/3856 32742 ==== osd_op_reply(33087 1000002d435.00000000 [read 2087907~60] ack = -2 (No such file or directory)) v4 ==== 119+0+0 (956439537 0 0) 0x7f2de3f05dc0 con 0x7f2e3c078340

It looks to me like the problem is the localized reads. I think the replica OSD has the writes committed to the journal but isn't ordering reads vs writes properly, allowing the client to see the stale fs state (before the object was written).

The quick workaround here is to disable localized reads. I suspect the fix on the OSD is going to be a bit more involved.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Project changed from CephFS to Ceph
Subject changed from libcephfs: reads return null instead of recently written data to osd: localized reads (from replicas) reordered wrt writes
Category changed from 44 to OSD
Assignee deleted (~~Sage Weil~~)
Priority changed from Urgent to High