Project

General

Profile

Actions

Bug #9679

closed

Ceph hadoop terasort job failure

Added by Huamin Chen over 9 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Hadoop/Java
Labels (FS):
Java/Hadoop
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hadoop version: 2.4.1

Ceph version:
ceph --version
ceph version 0.85-986-g031ef05 (031ef0551ebc98d824075558e884faf23ad92f2d)

How to reproduce:
Run 10GB terasort on plana85
su -l hadoop
./hadoop/bin/hadoop jar ./hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar terasort -Dmapreduce.job.maps=32 /in-dir /out-dir

Error:
14/10/07 08:02:36 INFO mapreduce.Job: Task Id : attempt_1412690505174_0006_m_000034_0, Status : FAILED
Error: java.io.EOFException: read past eof
at org.apache.hadoop.examples.terasort.TeraInputFormat$TeraRecordReader.nextKeyValue(TeraInputFormat.java:261)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


Files

client.admin.11179.log (722 KB) client.admin.11179.log bad client Noah Watkins, 10/10/2014 09:13 AM
client.admin.339.log (746 KB) client.admin.339.log good client Noah Watkins, 10/10/2014 09:13 AM
Actions #1

Updated by Noah Watkins over 9 years ago

Thanks for adding this. What command did you use to generate the input?

Actions #2

Updated by Huamin Chen over 9 years ago

Teragen command:
./hadoop/bin/hadoop jar ./hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar teragen -Dmapreduce.job.maps=32 100000000 /in-dir

Actions #5

Updated by Noah Watkins over 9 years ago

My bet at this point is on the generation of the input data set. Teragen creates a file with X 100byte entries. When multiple mappers are used they divide up the work. Some of the input files based on their size are clearly not containing a multiple of 100 bytes.

Here are what 1, 10 and 16 mappers created for the same requested input data set size

1 mapper

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-1m
14/10/08 22:43:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   2 hadoop           0 2014-10-08 21:45 /in-dir-1m/_SUCCESS
-rw-r--r--   2 hadoop 10000000000 2014-10-08 21:45 /in-dir-1m/part-m-00000

10 mappers

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-10m
14/10/08 22:41:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 11 items
-rw-r--r--   2 hadoop          0 2014-10-08 22:40 /in-dir-10m/_SUCCESS
-rw-r--r--   2 hadoop  995098624 2014-10-08 22:39 /in-dir-10m/part-m-00000
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00001
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00002
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00003
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00004
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00005
-rw-r--r--   2 hadoop  958398464 2014-10-08 22:39 /in-dir-10m/part-m-00006
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00007
-rw-r--r--   2 hadoop  967835648 2014-10-08 22:39 /in-dir-10m/part-m-00008
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00009

16 mappers

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-16m
14/10/08 22:32:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 17 items
-rw-r--r--   2 hadoop          0 2014-10-08 22:08 /in-dir-16m/_SUCCESS
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00000
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:06 /in-dir-16m/part-m-00001
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:06 /in-dir-16m/part-m-00002
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00003
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00004
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00005
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:06 /in-dir-16m/part-m-00006
-rw-r--r--   2 hadoop  612376576 2014-10-08 22:07 /in-dir-16m/part-m-00007
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00008
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00009
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00010
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00011
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00012
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00013
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00014
-rw-r--r--   2 hadoop  612372480 2014-10-08 22:07 /in-dir-16m/part-m-00015
Actions #6

Updated by Noah Watkins over 9 years ago

missing one of these?

int ceph_sync_fs(struct ceph_mount_info *cmount);
int ceph_fsync(struct ceph_mount_info *cmount, int fd, int syncdataonly);
Actions #7

Updated by Huamin Chen over 9 years ago

Run the same tests on HDFS 2.4.1, thoguh on a different setup. Terasort finished without any problem.

Cmd:
./hadoop/bin/hadoop jar ./hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar teragen -Dmapreduce.job.maps=32 100000000 /in-dir

teragen files:
./hadoop/bin/hadoop fs ls /in-dir
14/10/09 10:01:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 33 items
-rw-r--r-
1 hadoop supergroup 0 2014-10-09 10:00 /in-dir/_SUCCESS
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00000
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00001
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00002
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00003
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00004
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00005
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00006
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00007
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00008
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00009
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00010
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00011
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00012
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00013
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00014
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00015
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00016
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00017
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00018
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00019
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00020
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00021
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00022
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00023
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00024
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00025
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00026
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00027
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00028
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00029
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00030
rw-r--r- 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00031

Actions #8

Updated by Huamin Chen over 9 years ago

For comparison, teragen files on CephFS

./hadoop/bin/hadoop fs -ls /in-dir-3
14/10/09 08:05:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 33 items
rw-r--r- 2 hadoop 0 2014-10-09 08:04 /in-dir-3/_SUCCESS
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00000
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00001
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00002
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00003
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00004
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00005
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00006
rw-r--r- 2 hadoop 306184192 2014-10-09 08:02 /in-dir-3/part-m-00007
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00008
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00009
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00010
rw-r--r- 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00011
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00012
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00013
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00014
rw-r--r- 2 hadoop 287309824 2014-10-09 08:03 /in-dir-3/part-m-00015
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00016
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00017
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00018
rw-r--r- 2 hadoop 293601280 2014-10-09 08:03 /in-dir-3/part-m-00019
rw-r--r- 2 hadoop 287309824 2014-10-09 08:03 /in-dir-3/part-m-00020
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00021
rw-r--r- 2 hadoop 268435456 2014-10-09 08:03 /in-dir-3/part-m-00022
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00023
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00024
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00025
rw-r--r- 2 hadoop 283115520 2014-10-09 08:03 /in-dir-3/part-m-00026
rw-r--r- 2 hadoop 287309824 2014-10-09 08:03 /in-dir-3/part-m-00027
rw-r--r- 2 hadoop 306184192 2014-10-09 08:03 /in-dir-3/part-m-00028
rw-r--r- 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00029
rw-r--r- 2 hadoop 306184192 2014-10-09 08:03 /in-dir-3/part-m-00030
rw-r--r- 2 hadoop 306184192 2014-10-09 08:03 /in-dir-3/part-m-00031

Actions #9

Updated by Noah Watkins over 9 years ago

Thanks Huamin. Yeh, It looks like some writes are being lost, probably due to an unclean shutdown. I'll get some traces and look into this.

Actions #10

Updated by Noah Watkins over 9 years ago

empty fs:

nwatkins@martini ~$ cat log 
ubuntu@plana85:~$ sudo rados -p cephfs_data ls
ubuntu@plana85:~$

generate data with teragen:

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-10m
14/10/09 11:47:32 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Found 11 items
-rw-r--r--   2 hadoop          0 2014-10-09 11:39 /in-dir-10m/_SUCCESS
-rw-r--r--   2 hadoop  958398464 2014-10-09 11:38 /in-dir-10m/part-m-00000
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00001
-rw-r--r--   2 hadoop  939524096 2014-10-09 11:38 /in-dir-10m/part-m-00002
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00003
-rw-r--r--   2 hadoop  996147200 2014-10-09 11:38 /in-dir-10m/part-m-00004
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00005
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00006
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00007
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00008
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00009

sum up size of objects in rados. looks like the sizes match what's reported by cephfs. the extra 3 inodes are probably directories or something.

ubuntu@plana85:~$ cat out | python sum.py
13 inodes
10000001c8d 958398464
10000002c65 1000000000
100000004a7 334
100000008b6 1000000000
100000014a1 1000000000
100000004a9 92039
100000004a8 83772
10000002477 1000000000
100000010aa 1000000000
10000001898 1000000000
10000002082 1000000000
1000000286e 996147200
10000000cb3 939524096

Updated by Noah Watkins over 9 years ago

Here is the directory listing. All of the files should be the same size.

ubuntu@plana55:~$ ls -l ceph-mount/in-dir-1g-noah/
total 880475
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00000
-rw-r--r-- 1 hadoop hadoop  76546048 Oct 10 07:54 part-m-00001
-rw-r--r-- 1 hadoop hadoop  67108864 Oct 10 07:54 part-m-00002
-rw-r--r-- 1 hadoop hadoop  95420416 Oct 10 07:54 part-m-00003
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00004
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00005
-rw-r--r-- 1 hadoop hadoop  76546048 Oct 10 07:54 part-m-00006
-rw-r--r-- 1 hadoop hadoop  85983232 Oct 10 07:54 part-m-00007
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00008
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00009
-rw-r--r-- 1 hadoop hadoop         0 Oct 10 07:55 _SUCCESS

I have traces for each client that wrote these. Let's look at part-m-00002.

nwatkins@martini 96-writes$ grep -l "part-m-00002" *
client.admin.11179.log

There are 96 write calls:

nwatkins@martini 96-writes$ cat client.admin.11179.log | grep "write(" | wc -l
96

And here is what the calls look like:

nwatkins@martini 96-writes$ cat client.admin.11179.log | grep "write(" 
2014-10-10 07:54:02.904735 7fd0175d9700  3 client.5521 write(12, "...", 1048576, -1) = 1048576
2014-10-10 07:54:02.968177 7fd0175d9700  3 client.5521 write(12, "...", 1048576, -1) = 1048576
... clip ...
2014-10-10 07:54:04.889349 7fd0175d9700  3 client.5521 write(12, "...", 1048576, -1) = 1048576
2014-10-10 07:54:04.898805 7fd0175d9700  3 client.5521 write(12, "...", 385280, -1) = 385280

So, 96 writes. All but the last are 1048576 bytes. So, 1048576 * 95 + 385280 = 100000000. All the writes at least reach the client. The file is also successfully closed, right after the last write in the previous trace:

3456 2014-10-10 07:54:04.898805 7fd0175d9700  3 client.5521 write(12, "...", 385280, -1) = 385280
3457 2014-10-10 07:54:04.898808 7fd0175d9700 10 jni: write: exit ret 385280
3458 2014-10-10 07:54:04.898829 7fd0175d9700 10 jni: close: fd 12
3459 2014-10-10 07:54:04.898832 7fd0175d9700  3 client.5521 close enter(12)

What's interesting is that it appears there is a race with unmount and the hadoop client being killed. After the close the trace shows the client enter ceph_unmount but never exit. A trace from a client writing a complete file also doesn't exist unmount, but I do see at the very end of such a trace something about caps, adn that is not present in the trace from the client that generated a bad file.

I included both trace. bad file client: client.admin.11179.log good file client: client.admin.339.log.

The only other piece of information I know about regarding these files is that after the client writing is done, another process in the system renames the file. But that should be serialized with the closing of the file by the client writing.

nwatkins@martini teragen-10m-1g$ grep "in-dir-1g-noah\/part-" *.log | grep rename
client.admin.10009.log:2014-10-10 07:55:08.139031 7fd17903d700 10 jni: rename: from /in-dir-1g-noah/_temporary/1/task_1412832704798_0006_m_000002/part-m-00002 to /in-dir-1g-noah/part-m-00002
client.admin.10009.log:2014-10-10 07:55:08.197841 7fd17903d700 10 jni: rename: from /in-dir-1g-noah/_temporary/1/task_1412832704798_0006_m_000009/part-m-00009 to /in-dir-1g-noah/part-m-00009
Actions #12

Updated by Greg Farnum over 9 years ago

Looking at the bad client (11139), the first thing I notice is that the messaging is way backed up. What's the network pipe here, and is it being shared?

2014-10-10 07:54:04.102583 7fcff7439700 20 -- 10.214.132.23:0/1819159467 submit_message osd_op(client.5521.0:6 10000000cd0.00000000 [write 37748736~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4 remote, 10.214.132.27:6800/7357, have pipe.
...
2014-10-10 07:54:04.937606 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer encoding 6 features 35184372088831 0x7fd0001c08b0 osd_op(client.5521.0:6 10000000cd0.00000000 [write 37748736~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4
2014-10-10 07:54:04.939968 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer signed seq # 6): sig = 6511400259401246671
2014-10-10 07:54:04.940000 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer sending 6 0x7fd0001c08b0

So here we have the sixth message (the fifth write) being encoded onto the wire some .84 seconds after it was queued by the upper layers. That's fairly unusual. If I do some sampling on the Pipe writer thread:

2014-10-10 07:54:03.199084 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer encoding 2 features 35184372088831 0x7fd0001c08b0 osd_op(client.5521.0:2 10000000cd0.00000000 [write 0~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4
2014-10-10 07:54:03.201066 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer signed seq # 2): sig = 7993448059363581721
2014-10-10 07:54:03.201095 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer sending 2 0x7fd0001c08b0
2014-10-10 07:54:03.869905 7fcff3f2e700 10 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer: state = open policy.server=0
2014-10-10 07:54:03.869950 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer encoding 3 features 35184372088831 0x7fd0001c10b0 osd_op(client.5521.0:3 10000000cd0.00000000 [write 9437184~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4
2014-10-10 07:54:03.872856 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer signed seq # 3): sig = 17390466052185533091
2014-10-10 07:54:03.872902 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer sending 3 0x7fd0001c10b0
2014-10-10 07:54:04.247741 7fcff3f2e700 10 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer: state = open policy.server=0

It looks like simply writing the second message out to (not necessarily even over) the socket took some .66 seconds; the third took ~.37. Those are 9MB of data each, which on a 1Gb connection is ~.075 seconds of time per.

Then the end of the log...doesn't look like an end. I can see the unmount starting and waiting on an MDS request, but there should also be lots of flushing of data and things.

Is it possible that the Java or Hadoop runtime is terminating the process when it blocks for a little while?

The good client (339) is not suffering from the strange messenger blocked-up-ness.

Actions #13

Updated by Noah Watkins over 9 years ago

I do believe that Hadoop kills the clients after they reach a point that the run-time believes everything has been flushed. I wrongly assumed that close flushed everything out. Adding fsync to close seems to make things work as expected.

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-1g-noah
Found 11 items
-rw-r--r--   2 hadoop          0 2014-10-10 16:06 /in-dir-1g-noah/_SUCCESS
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00000
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00001
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00002
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00003
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00004
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00005
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00006
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00007
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00008
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00009
hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-10g-noah
14/10/10 16:10:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 32 items
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00000
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:07 /in-dir-10g-noah/part-m-00001
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00002
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00003
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00004
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:07 /in-dir-10g-noah/part-m-00005
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00006
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00007
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00008
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:07 /in-dir-10g-noah/part-m-00009
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00010
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00011
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00012
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00013
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00014
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00015
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00016
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00017
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00018
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00019
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00020
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00021
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00022
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00023
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00024
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00025
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00026
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00027
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00028
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00029
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00030
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00031
Actions #14

Updated by Noah Watkins over 9 years ago

  • Status changed from New to Closed

Fixed in cephfs-hadoop repo.

Actions #15

Updated by Dmitry Buzz about 8 years ago

I'm testing ceph version 10.0.3-1325-g98fba62 on hadoop 3.0.0 and I see the same error in hadoop terasort.
I'm not sure if this is hadoop or ceph issue but I can upload logs if ceph development want to pursue this issue further.

Actions #16

Updated by Greg Farnum about 8 years ago

Dmitry, can you elucidate more on your environment? Which Hadoop, and what bindings did you use to connect Ceph and Hadoop?

(Also, ping Noah.)

Actions #17

Updated by Dmitry Buzz about 8 years ago

Greg,

My setup was simple: ceph jni bindings were build from ceph repo.
Cehp hadoop2 integration was build from here: https://github.com/GregBowyer/cephfs-hadoop.git using maven. All .so and jar files were exposed accordingly and simple put/get file worked ok, but terasort benchmark showed this error.

For what it worth, I think ceph shouldn't chase hadoop HDFS compatibility and rather design proper interface to Spark RDD and Dataframes exposing data locality. This is a better approach than impersonate Ceph in place of HDFS since it decouples Ceph from HDFS dependency. I'm not sure if this is possible though. I've also found on Terasort tests Ceph (x2 replication) is 25%-40% slower than plain HDFS (x2 replication) and this came as a bit of surprise to me. On the other hand if Ceph community wants to maintain proper integration with Hadoop, it should really takeover and maintain cephfs-hadoop in sync with Hadoop versions. It takes unimaginable amount of effort to integrate both, since Ceph site documentation and user list archives are outdated on this subject and googling every possible hiccup towards such integration is not for a fainthearted.

The other route is to develop data locality standard for parallel data stores like Ceph and expose it to host processes in sort of universal manner, independent on underlying data format (CSV, object storage, parallel HDF5, etc...). This is a daunting task, but without this Ceph is not attractive option as an underlying storage for parallel computation frameworks in the ML and Data Analytics domain.

Actions #18

Updated by Dmitry Buzz about 8 years ago

I forgot to mention, I tested both: Hadoop 2.7.1 and 2.7.2 with the same outcome.
64 bit Arch linux circa Jan 2015. 2-10 node cluster.

Actions #19

Updated by Greg Farnum almost 8 years ago

  • Component(FS) Hadoop/Java added
Actions #20

Updated by Patrick Donnelly about 5 years ago

  • Category deleted (48)
  • Labels (FS) Java/Hadoop added
Actions

Also available in: Atom PDF