Bug #9679: Ceph hadoop terasort job failure - CephFS - Ceph

Actions

Copy link

Bug #9679

closed

Ceph hadoop terasort job failure

Added by Huamin Chen over 9 years ago. Updated about 5 years ago.

Status:

Closed

Priority:

High

Assignee:

Noah Watkins

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Hadoop/Java

Labels (FS):

Java/Hadoop

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hadoop version: 2.4.1

Ceph version:
ceph --version
ceph version 0.85-986-g031ef05 (031ef0551ebc98d824075558e884faf23ad92f2d)

How to reproduce:
Run 10GB terasort on plana85
su -l hadoop
./hadoop/bin/hadoop jar ./hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar terasort -Dmapreduce.job.maps=32 /in-dir /out-dir

Error:
14/10/07 08:02:36 INFO mapreduce.Job: Task Id : attempt_1412690505174_0006_m_000034_0, Status : FAILED
Error: java.io.EOFException: read past eof
at org.apache.hadoop.examples.terasort.TeraInputFormat$TeraRecordReader.nextKeyValue(TeraInputFormat.java:261)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Files

Download all files

client.admin.11179.log (722 KB) client.admin.11179.log	bad client	Noah Watkins, 10/10/2014 09:13 AM
client.admin.339.log (746 KB) client.admin.339.log	good client	Noah Watkins, 10/10/2014 09:13 AM

Actions

Copy link

Updated by Noah Watkins over 9 years ago

Thanks for adding this. What command did you use to generate the input?

Actions

Copy link

Updated by Huamin Chen over 9 years ago

Teragen command:
./hadoop/bin/hadoop jar ./hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar teragen -Dmapreduce.job.maps=32 100000000 /in-dir

Actions

Copy link

Updated by Noah Watkins over 9 years ago

https://svn.apache.org/repos/asf/hadoop/common/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/terasort/TeraInputFormat.java

Actions

Copy link

Updated by Noah Watkins over 9 years ago

https://issues.apache.org/jira/browse/MAPREDUCE-2018

Actions

Copy link

Updated by Noah Watkins over 9 years ago

My bet at this point is on the generation of the input data set. Teragen creates a file with X 100byte entries. When multiple mappers are used they divide up the work. Some of the input files based on their size are clearly not containing a multiple of 100 bytes.

Here are what 1, 10 and 16 mappers created for the same requested input data set size

1 mapper

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-1m
14/10/08 22:43:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   2 hadoop           0 2014-10-08 21:45 /in-dir-1m/_SUCCESS
-rw-r--r--   2 hadoop 10000000000 2014-10-08 21:45 /in-dir-1m/part-m-00000

10 mappers

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-10m
14/10/08 22:41:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 11 items
-rw-r--r--   2 hadoop          0 2014-10-08 22:40 /in-dir-10m/_SUCCESS
-rw-r--r--   2 hadoop  995098624 2014-10-08 22:39 /in-dir-10m/part-m-00000
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00001
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00002
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00003
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00004
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00005
-rw-r--r--   2 hadoop  958398464 2014-10-08 22:39 /in-dir-10m/part-m-00006
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00007
-rw-r--r--   2 hadoop  967835648 2014-10-08 22:39 /in-dir-10m/part-m-00008
-rw-r--r--   2 hadoop 1000000000 2014-10-08 22:39 /in-dir-10m/part-m-00009

16 mappers

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-16m
14/10/08 22:32:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 17 items
-rw-r--r--   2 hadoop          0 2014-10-08 22:08 /in-dir-16m/_SUCCESS
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00000
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:06 /in-dir-16m/part-m-00001
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:06 /in-dir-16m/part-m-00002
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00003
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00004
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00005
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:06 /in-dir-16m/part-m-00006
-rw-r--r--   2 hadoop  612376576 2014-10-08 22:07 /in-dir-16m/part-m-00007
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00008
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00009
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00010
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00011
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00012
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00013
-rw-r--r--   2 hadoop  625000000 2014-10-08 22:07 /in-dir-16m/part-m-00014
-rw-r--r--   2 hadoop  612372480 2014-10-08 22:07 /in-dir-16m/part-m-00015

Actions

Copy link

Updated by Noah Watkins over 9 years ago

missing one of these?

int ceph_sync_fs(struct ceph_mount_info *cmount);
int ceph_fsync(struct ceph_mount_info *cmount, int fd, int syncdataonly);

Actions

Copy link

Updated by Huamin Chen over 9 years ago

Run the same tests on HDFS 2.4.1, thoguh on a different setup. Terasort finished without any problem.

Cmd:
./hadoop/bin/hadoop jar ./hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar teragen -Dmapreduce.job.maps=32 100000000 /in-dir

teragen files:
./hadoop/bin/hadoop fs ls /in-dir
14/10/09 10:01:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 33 items
-rw-r--r- 1 hadoop supergroup 0 2014-10-09 10:00 /in-dir/_SUCCESS
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00000
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00001
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00002
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00003
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00004
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00005
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00006
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00007
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00008
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00009
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00010
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00011
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00012
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00013
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00014
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00015
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00016
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00017
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00018
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00019
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00020
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00021
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00022
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00023
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00024
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00025
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00026
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00027
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00028
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00029
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00030
~~rw-r--r-~~ 1 hadoop supergroup 312500000 2014-10-09 10:00 /in-dir/part-m-00031

Actions

Copy link

Updated by Huamin Chen over 9 years ago

For comparison, teragen files on CephFS

./hadoop/bin/hadoop fs -ls /in-dir-3
14/10/09 08:05:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 33 items
~~rw-r--r-~~ 2 hadoop 0 2014-10-09 08:04 /in-dir-3/_SUCCESS
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00000
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00001
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00002
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00003
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00004
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00005
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00006
~~rw-r--r-~~ 2 hadoop 306184192 2014-10-09 08:02 /in-dir-3/part-m-00007
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00008
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00009
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00010
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:02 /in-dir-3/part-m-00011
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00012
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00013
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00014
~~rw-r--r-~~ 2 hadoop 287309824 2014-10-09 08:03 /in-dir-3/part-m-00015
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00016
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00017
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00018
~~rw-r--r-~~ 2 hadoop 293601280 2014-10-09 08:03 /in-dir-3/part-m-00019
~~rw-r--r-~~ 2 hadoop 287309824 2014-10-09 08:03 /in-dir-3/part-m-00020
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00021
~~rw-r--r-~~ 2 hadoop 268435456 2014-10-09 08:03 /in-dir-3/part-m-00022
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00023
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00024
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00025
~~rw-r--r-~~ 2 hadoop 283115520 2014-10-09 08:03 /in-dir-3/part-m-00026
~~rw-r--r-~~ 2 hadoop 287309824 2014-10-09 08:03 /in-dir-3/part-m-00027
~~rw-r--r-~~ 2 hadoop 306184192 2014-10-09 08:03 /in-dir-3/part-m-00028
~~rw-r--r-~~ 2 hadoop 312500000 2014-10-09 08:03 /in-dir-3/part-m-00029
~~rw-r--r-~~ 2 hadoop 306184192 2014-10-09 08:03 /in-dir-3/part-m-00030
~~rw-r--r-~~ 2 hadoop 306184192 2014-10-09 08:03 /in-dir-3/part-m-00031

Actions

Copy link

Updated by Noah Watkins over 9 years ago

Thanks Huamin. Yeh, It looks like some writes are being lost, probably due to an unclean shutdown. I'll get some traces and look into this.

Actions

Copy link

#10

Updated by Noah Watkins over 9 years ago

empty fs:

nwatkins@martini ~$ cat log 
ubuntu@plana85:~$ sudo rados -p cephfs_data ls
ubuntu@plana85:~$

generate data with teragen:

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-10m
14/10/09 11:47:32 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Found 11 items
-rw-r--r--   2 hadoop          0 2014-10-09 11:39 /in-dir-10m/_SUCCESS
-rw-r--r--   2 hadoop  958398464 2014-10-09 11:38 /in-dir-10m/part-m-00000
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00001
-rw-r--r--   2 hadoop  939524096 2014-10-09 11:38 /in-dir-10m/part-m-00002
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00003
-rw-r--r--   2 hadoop  996147200 2014-10-09 11:38 /in-dir-10m/part-m-00004
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00005
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00006
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00007
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00008
-rw-r--r--   2 hadoop 1000000000 2014-10-09 11:38 /in-dir-10m/part-m-00009

sum up size of objects in rados. looks like the sizes match what's reported by cephfs. the extra 3 inodes are probably directories or something.

ubuntu@plana85:~$ cat out | python sum.py
13 inodes
10000001c8d 958398464
10000002c65 1000000000
100000004a7 334
100000008b6 1000000000
100000014a1 1000000000
100000004a9 92039
100000004a8 83772
10000002477 1000000000
100000010aa 1000000000
10000001898 1000000000
10000002082 1000000000
1000000286e 996147200
10000000cb3 939524096

Actions

Copy link Download all files

#11

Updated by Noah Watkins over 9 years ago

File client.admin.11179.log client.admin.11179.log added
File client.admin.339.log client.admin.339.log added

Here is the directory listing. All of the files should be the same size.

ubuntu@plana55:~$ ls -l ceph-mount/in-dir-1g-noah/
total 880475
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00000
-rw-r--r-- 1 hadoop hadoop  76546048 Oct 10 07:54 part-m-00001
-rw-r--r-- 1 hadoop hadoop  67108864 Oct 10 07:54 part-m-00002
-rw-r--r-- 1 hadoop hadoop  95420416 Oct 10 07:54 part-m-00003
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00004
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00005
-rw-r--r-- 1 hadoop hadoop  76546048 Oct 10 07:54 part-m-00006
-rw-r--r-- 1 hadoop hadoop  85983232 Oct 10 07:54 part-m-00007
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00008
-rw-r--r-- 1 hadoop hadoop 100000000 Oct 10 07:54 part-m-00009
-rw-r--r-- 1 hadoop hadoop         0 Oct 10 07:55 _SUCCESS

I have traces for each client that wrote these. Let's look at part-m-00002.

nwatkins@martini 96-writes$ grep -l "part-m-00002" *
client.admin.11179.log

There are 96 write calls:

nwatkins@martini 96-writes$ cat client.admin.11179.log | grep "write(" | wc -l
96

And here is what the calls look like:

nwatkins@martini 96-writes$ cat client.admin.11179.log | grep "write(" 
2014-10-10 07:54:02.904735 7fd0175d9700  3 client.5521 write(12, "...", 1048576, -1) = 1048576
2014-10-10 07:54:02.968177 7fd0175d9700  3 client.5521 write(12, "...", 1048576, -1) = 1048576
... clip ...
2014-10-10 07:54:04.889349 7fd0175d9700  3 client.5521 write(12, "...", 1048576, -1) = 1048576
2014-10-10 07:54:04.898805 7fd0175d9700  3 client.5521 write(12, "...", 385280, -1) = 385280

So, 96 writes. All but the last are 1048576 bytes. So, 1048576 * 95 + 385280 = 100000000. All the writes at least reach the client. The file is also successfully closed, right after the last write in the previous trace:

3456 2014-10-10 07:54:04.898805 7fd0175d9700  3 client.5521 write(12, "...", 385280, -1) = 385280
3457 2014-10-10 07:54:04.898808 7fd0175d9700 10 jni: write: exit ret 385280
3458 2014-10-10 07:54:04.898829 7fd0175d9700 10 jni: close: fd 12
3459 2014-10-10 07:54:04.898832 7fd0175d9700  3 client.5521 close enter(12)

What's interesting is that it appears there is a race with unmount and the hadoop client being killed. After the close the trace shows the client enter ceph_unmount but never exit. A trace from a client writing a complete file also doesn't exist unmount, but I do see at the very end of such a trace something about caps, adn that is not present in the trace from the client that generated a bad file.

I included both trace. bad file client: client.admin.11179.log good file client: client.admin.339.log.

The only other piece of information I know about regarding these files is that after the client writing is done, another process in the system renames the file. But that should be serialized with the closing of the file by the client writing.

nwatkins@martini teragen-10m-1g$ grep "in-dir-1g-noah\/part-" *.log | grep rename
client.admin.10009.log:2014-10-10 07:55:08.139031 7fd17903d700 10 jni: rename: from /in-dir-1g-noah/_temporary/1/task_1412832704798_0006_m_000002/part-m-00002 to /in-dir-1g-noah/part-m-00002
client.admin.10009.log:2014-10-10 07:55:08.197841 7fd17903d700 10 jni: rename: from /in-dir-1g-noah/_temporary/1/task_1412832704798_0006_m_000009/part-m-00009 to /in-dir-1g-noah/part-m-00009

Actions

Copy link

#12

Updated by Greg Farnum over 9 years ago

Looking at the bad client (11139), the first thing I notice is that the messaging is way backed up. What's the network pipe here, and is it being shared?

2014-10-10 07:54:04.102583 7fcff7439700 20 -- 10.214.132.23:0/1819159467 submit_message osd_op(client.5521.0:6 10000000cd0.00000000 [write 37748736~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4 remote, 10.214.132.27:6800/7357, have pipe.
...
2014-10-10 07:54:04.937606 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer encoding 6 features 35184372088831 0x7fd0001c08b0 osd_op(client.5521.0:6 10000000cd0.00000000 [write 37748736~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4
2014-10-10 07:54:04.939968 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer signed seq # 6): sig = 6511400259401246671
2014-10-10 07:54:04.940000 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer sending 6 0x7fd0001c08b0

So here we have the sixth message (the fifth write) being encoded onto the wire some .84 seconds after it was queued by the upper layers. That's fairly unusual. If I do some sampling on the Pipe writer thread:

2014-10-10 07:54:03.199084 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer encoding 2 features 35184372088831 0x7fd0001c08b0 osd_op(client.5521.0:2 10000000cd0.00000000 [write 0~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4
2014-10-10 07:54:03.201066 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer signed seq # 2): sig = 7993448059363581721
2014-10-10 07:54:03.201095 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer sending 2 0x7fd0001c08b0
2014-10-10 07:54:03.869905 7fcff3f2e700 10 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer: state = open policy.server=0
2014-10-10 07:54:03.869950 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer encoding 3 features 35184372088831 0x7fd0001c10b0 osd_op(client.5521.0:3 10000000cd0.00000000 [write 9437184~9437184] 1.4b78efdc snapc 1=[] ondisk+write+localize_reads+known_if_redirected e33) v4
2014-10-10 07:54:03.872856 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer signed seq # 3): sig = 17390466052185533091
2014-10-10 07:54:03.872902 7fcff3f2e700 20 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer sending 3 0x7fd0001c10b0
2014-10-10 07:54:04.247741 7fcff3f2e700 10 -- 10.214.132.23:0/1819159467 >> 10.214.132.27:6800/7357 pipe(0x7fd010ee4ee0 sd=194 :50309 s=2 pgs=1009 cs=1 l=1 c=0x7fd010ee5170).writer: state = open policy.server=0

It looks like simply writing the second message out to (not necessarily even over) the socket took some .66 seconds; the third took ~.37. Those are 9MB of data each, which on a 1Gb connection is ~.075 seconds of time per.

Then the end of the log...doesn't look like an end. I can see the unmount starting and waiting on an MDS request, but there should also be lots of flushing of data and things.

Is it possible that the Java or Hadoop runtime is terminating the process when it blocks for a little while?

The good client (339) is not suffering from the strange messenger blocked-up-ness.

Actions

Copy link

#13

Updated by Noah Watkins over 9 years ago

I do believe that Hadoop kills the clients after they reach a point that the run-time believes everything has been flushed. I wrongly assumed that close flushed everything out. Adding fsync to close seems to make things work as expected.

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-1g-noah
Found 11 items
-rw-r--r--   2 hadoop          0 2014-10-10 16:06 /in-dir-1g-noah/_SUCCESS
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00000
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00001
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00002
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00003
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00004
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00005
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00006
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00007
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00008
-rw-r--r--   2 hadoop  100000000 2014-10-10 16:05 /in-dir-1g-noah/part-m-00009

hadoop@plana85:~$ hadoop/bin/hadoop fs -ls /in-dir-10g-noah
14/10/10 16:10:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 32 items
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00000
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:07 /in-dir-10g-noah/part-m-00001
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00002
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00003
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00004
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:07 /in-dir-10g-noah/part-m-00005
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00006
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00007
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00008
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:07 /in-dir-10g-noah/part-m-00009
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00010
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00011
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00012
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00013
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00014
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00015
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00016
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00017
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00018
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00019
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00020
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00021
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00022
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00023
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00024
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00025
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00026
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00027
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00028
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00029
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00030
-rw-r--r--   2 hadoop  312500000 2014-10-10 16:08 /in-dir-10g-noah/part-m-00031

Actions

Copy link

#14

Updated by Noah Watkins over 9 years ago

Status changed from New to Closed

Fixed in cephfs-hadoop repo.

Actions

Copy link

#15

Updated by Dmitry Buzz about 8 years ago

I'm testing ceph version 10.0.3-1325-g98fba62 on hadoop 3.0.0 and I see the same error in hadoop terasort.
I'm not sure if this is hadoop or ceph issue but I can upload logs if ceph development want to pursue this issue further.

Actions

Copy link

#16

Updated by Greg Farnum about 8 years ago

Dmitry, can you elucidate more on your environment? Which Hadoop, and what bindings did you use to connect Ceph and Hadoop?

(Also, ping Noah.)

Actions

Copy link

#17

Updated by Dmitry Buzz about 8 years ago

Greg,

My setup was simple: ceph jni bindings were build from ceph repo.
Cehp hadoop2 integration was build from here: https://github.com/GregBowyer/cephfs-hadoop.git using maven. All .so and jar files were exposed accordingly and simple put/get file worked ok, but terasort benchmark showed this error.

For what it worth, I think ceph shouldn't chase hadoop HDFS compatibility and rather design proper interface to Spark RDD and Dataframes exposing data locality. This is a better approach than impersonate Ceph in place of HDFS since it decouples Ceph from HDFS dependency. I'm not sure if this is possible though. I've also found on Terasort tests Ceph (x2 replication) is 25%-40% slower than plain HDFS (x2 replication) and this came as a bit of surprise to me. On the other hand if Ceph community wants to maintain proper integration with Hadoop, it should really takeover and maintain cephfs-hadoop in sync with Hadoop versions. It takes unimaginable amount of effort to integrate both, since Ceph site documentation and user list archives are outdated on this subject and googling every possible hiccup towards such integration is not for a fainthearted.

The other route is to develop data locality standard for parallel data stores like Ceph and expose it to host processes in sort of universal manner, independent on underlying data format (CSV, object storage, parallel HDF5, etc...). This is a daunting task, but without this Ceph is not attractive option as an underlying storage for parallel computation frameworks in the ML and Data Analytics domain.

Actions

Copy link

#18

Updated by Dmitry Buzz about 8 years ago

I forgot to mention, I tested both: Hadoop 2.7.1 and 2.7.2 with the same outcome.
64 bit Arch linux circa Jan 2015. 2-10 node cluster.

Actions

Copy link

#19

Updated by Greg Farnum almost 8 years ago

Component(FS) Hadoop/Java added

Actions

Copy link

#20

Updated by Patrick Donnelly about 5 years ago

Category deleted (48)
Labels (FS) Java/Hadoop added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #9679

Ceph hadoop terasort job failure

Updated by Noah Watkins over 9 years ago

Updated by Huamin Chen over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Huamin Chen over 9 years ago

Updated by Huamin Chen over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Greg Farnum over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Noah Watkins over 9 years ago

Updated by Dmitry Buzz about 8 years ago

Updated by Greg Farnum about 8 years ago

Updated by Dmitry Buzz about 8 years ago

Updated by Dmitry Buzz about 8 years ago

Updated by Greg Farnum almost 8 years ago

Updated by Patrick Donnelly about 5 years ago