Project

General

Profile

CephFS - Hadoop Support » History » Version 2

Jessica Mack, 07/01/2015 10:36 PM

1 1 Jessica Mack
h1. CephFS - Hadoop Support
2 2 Jessica Mack
3
h3. Summary
4
5
Overview of the current status of Hadoop support on Ceph. what we are working on now, and the development roadmap.
6
7
h3. Owners
8
9
* Noah Watkins (RedHat, UCSC)
10
* Name (Affiliation)
11
* Name
12
13
h3. Interested Parties
14
15
* Name (Affiliation)
16
* Name (Affiliation)
17
* Name
18
19
h3. Current Status
20
21
h4. Results from HCFS Test Suite
22
23
The HCFS tests are now in hadoop-common. We are running them against our cephfs-hadoop bindings and have been squashing bugs for the past couple weeks. This is the current state of issues:
24
25
h4. HCFS Resources
26
27
* Documents describing semantics
28
* https://github.com/apache/hadoop-common/tree/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem
29
* https://github.com/apache/hadoop/tree/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/contract
30
* https://issues.apache.org/jira/browse/HADOOP-9371
31
32
h4. Results
33
34
* Tests run: 61, Failures: 3, Errors: 1, Skipped: 4
35
* Errors:
36
** We reported problem in HCFS (https://issues.apache.org/jira/browse/HADOOP-11244)
37
* Skipped:
38
** File concatenation API
39
*** void concat(finalPath target, finalPath [] sources)
40
*** This is a little-used operation currently implemented only by HDFS.
41
*** Support with a simple re-write hack
42
*** Optimized CephFS support?
43
** ​Root directory tests
44
*** ​libcephfs bug rmdir("/")
45
*** #9935
46
* Failures:
47
** testRenameFileOverExistingFiles
48
** testRenameFileNonexistentDir​
49
*** Rename semantics for HCFS are complicated.
50
*** Is rename in Ceph atomic?
51
**** According to HCFS we only need the core rename op to be atomic, and the rest of semantics can be emulated in our binding.
52
** testNoMkdirOverFile​
53
 
54
h4. BigTop/ceph-qa-suite Tests
55
56
* Not completed, supposedly very easy
57
* Integration
58
** ceph-qa-suite
59
** Jenkins?
60
61
h4. Clock Sync
62
63
* I haven't seen this issue come up in a long time
64
* #1666
65
66
h4. Snapshots and Quotas
67
68
Haven't investigated the Ceph side of this. There are documents describing HDFS behavior for reference.
69
* https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html
70
* https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html
71
72
h4. Client Shutdown Woes
73
74
When processes using libcephfs exit without first unmounting, other clients may experience delays (e.g. `ls`) waiting for timeouts to expire. There are a few scenarios that we've run into.
75
76
h4. Scenario 1
77
78
Some processes just don't shutdown cleanly. These are relatively easy to identify on a case-by-case basis. For instance, it looks like this is true for MRAppMaster and there is an open bug report for this https://issues.apache.org/jira/browse/MAPREDUCE-6136. Generally the file systems will be closed automatically unless explicit control is requested. This hasn't been an issue.
79
80
h4. Scenario 2
81
82
# Map tasks finish, broadcast success
83
# Simultaneously
84
## SIGTERM->map tasks, 250ms delay, SIGKILL->map tasks
85
## Application master examines file system to verify success
86
 
87
In this scenario SIGTERM will invoke file system clean-up (i.e. libcephfs unmount) on all the clients, but the 250ms delay isn't an adequate delay for libcephfs unmounting. The result is that the application master hangs for about 30 seconds. The solution is to increase the delay before SIGKILL is sent.
88
 
89
Curiously, it doesn't appear that libcephfs clients need to fully unmount, they only need to make it far enough through the process. Even when the processes are given a 30 second delay before SIGKILL (this is in YARN), many of the ceph client logs are truncated within ceph_unmount, so it appears they are exiting/killed through another path.
90
91
h4. Generalization
92
93
This is really a generalization of the previous scenario, but it will occur for _any_ reason the task can't reach ceph_unmount.
94
# YARN wants to kill a task that has mounted ceph, sends SIGTERM
95
# The task being killed isn't able to invoke shutdown within the delay before SIGKILL​
96
 
97
Some cases I've seen recently
98
# Client stuck in fsync for 40 seconds due to laggy osds
99
## CephFS-Java prevents ceph_unmount from racing with other operations
100
### Perhaps this should cause other threads to abort their operations
101
# They could be stuck due to other clients' unclean shutdown
102
## Some sort of general cascading problem
103
# But could generally be stuck for any reason
104
105
h4. Take Aways
106
107
* Always prefer clients to shutdown cleanly
108
** Through normal process exit paths
109
** Asynchronously from signal (SIGTERM + delay + SIGKILL)
110
*** Shorter (bounded?) unmount cost
111
** Process stuck in libcephfs
112
*** ​Unmount can force clean up threads?
113
* Forced exit without reaching unmount
114
** Maybe not a common case, no big deal
115
** How to avoid cascading problems
116
117
h4. HCFS
118
119
* Doesn't appear to define any sort of semantics for closing a the file system, which suggests that all the important things are handled by the semantics of file.close/file.flush.
120
* In the process of clarifying these points
121
122
h3. Next Steps
123
124
* Finishing with HCFS bugs
125
* 30+ OSD cluster for performance tests
126
** Profiling
127
* hdfs as baseline vs libcephfs benchmark tool...
128
** fio backend?
129
130
h3. Work items
131
132
h4. Coding tasks
133
134
# Task 1
135
# Task 2
136
# Task 3
137
138
h4. Build / release tasks
139
140
# Task 1
141
# Task 2
142
# Task 3
143
144
h4. Documentation tasks
145
146
# Task 1
147
# Task 2
148
# Task 3
149
150
h4. Deprecation tasks
151
152
# Task 1
153
# Task 2
154
# Task 3