Version 2 - History - CephFS - Hadoop Support - Ceph - Ceph

2

Jessica Mack

3

h3. Summary

4

5

Overview of the current status of Hadoop support on Ceph. what we are working on now, and the development roadmap.

6

7

h3. Owners

8

9

* Noah Watkins (RedHat, UCSC)

10

* Name (Affiliation)

11

* Name

12

13

h3. Interested Parties

14

15

* Name (Affiliation)

16

* Name (Affiliation)

17

* Name

18

19

h3. Current Status

20

21

h4. Results from HCFS Test Suite

22

23

The HCFS tests are now in hadoop-common. We are running them against our cephfs-hadoop bindings and have been squashing bugs for the past couple weeks. This is the current state of issues:

24

25

h4. HCFS Resources

26

27

* Documents describing semantics

28

* https://github.com/apache/hadoop-common/tree/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem

29

* https://github.com/apache/hadoop/tree/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/contract

30

* https://issues.apache.org/jira/browse/HADOOP-9371

31

32

h4. Results

33

34

* Tests run: 61, Failures: 3, Errors: 1, Skipped: 4

35

* Errors:

36

** We reported problem in HCFS (https://issues.apache.org/jira/browse/HADOOP-11244)

37

* Skipped:

38

** File concatenation API

39

*** void concat(finalPath target, finalPath [] sources)

40

*** This is a little-used operation currently implemented only by HDFS.

41

*** Support with a simple re-write hack

42

*** Optimized CephFS support?

43

** Root directory tests

44

*** libcephfs bug rmdir("/")

45

*** #9935

46

* Failures:

47

** testRenameFileOverExistingFiles

48

** testRenameFileNonexistentDir

49

*** Rename semantics for HCFS are complicated.

50

*** Is rename in Ceph atomic?

51

**** According to HCFS we only need the core rename op to be atomic, and the rest of semantics can be emulated in our binding.

52

** testNoMkdirOverFile

53

54

h4. BigTop/ceph-qa-suite Tests

55

56

* Not completed, supposedly very easy

57

* Integration

58

** ceph-qa-suite

59

** Jenkins?

60

61

h4. Clock Sync

62

63

* I haven't seen this issue come up in a long time

64

* #1666

65

66

h4. Snapshots and Quotas

67

68

Haven't investigated the Ceph side of this. There are documents describing HDFS behavior for reference.

69

* https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html

70

* https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

71

72

h4. Client Shutdown Woes

73

74

When processes using libcephfs exit without first unmounting, other clients may experience delays (e.g. `ls`) waiting for timeouts to expire. There are a few scenarios that we've run into.

75

76

h4. Scenario 1

77

78

Some processes just don't shutdown cleanly. These are relatively easy to identify on a case-by-case basis. For instance, it looks like this is true for MRAppMaster and there is an open bug report for this https://issues.apache.org/jira/browse/MAPREDUCE-6136. Generally the file systems will be closed automatically unless explicit control is requested. This hasn't been an issue.

79

80

h4. Scenario 2

81

82

# Map tasks finish, broadcast success

83

# Simultaneously

84

## SIGTERM->map tasks, 250ms delay, SIGKILL->map tasks

85

## Application master examines file system to verify success

86

87

In this scenario SIGTERM will invoke file system clean-up (i.e. libcephfs unmount) on all the clients, but the 250ms delay isn't an adequate delay for libcephfs unmounting. The result is that the application master hangs for about 30 seconds. The solution is to increase the delay before SIGKILL is sent.

88

89

Curiously, it doesn't appear that libcephfs clients need to fully unmount, they only need to make it far enough through the process. Even when the processes are given a 30 second delay before SIGKILL (this is in YARN), many of the ceph client logs are truncated within ceph_unmount, so it appears they are exiting/killed through another path.

90

91

h4. Generalization

92

93

This is really a generalization of the previous scenario, but it will occur for _any_ reason the task can't reach ceph_unmount.

94

# YARN wants to kill a task that has mounted ceph, sends SIGTERM

95

# The task being killed isn't able to invoke shutdown within the delay before SIGKILL

96

97

Some cases I've seen recently

98

# Client stuck in fsync for 40 seconds due to laggy osds

99

## CephFS-Java prevents ceph_unmount from racing with other operations

100

### Perhaps this should cause other threads to abort their operations

101

# They could be stuck due to other clients' unclean shutdown

102

## Some sort of general cascading problem

103

# But could generally be stuck for any reason

104

105

h4. Take Aways

106

107

* Always prefer clients to shutdown cleanly

108

** Through normal process exit paths

109

** Asynchronously from signal (SIGTERM + delay + SIGKILL)

110

*** Shorter (bounded?) unmount cost

111

** Process stuck in libcephfs

112

*** Unmount can force clean up threads?

113

* Forced exit without reaching unmount

114

** Maybe not a common case, no big deal

115

** How to avoid cascading problems

116

117

h4. HCFS

118

119

* Doesn't appear to define any sort of semantics for closing a the file system, which suggests that all the important things are handled by the semantics of file.close/file.flush.

120

* In the process of clarifying these points

121

122

h3. Next Steps

123

124

* Finishing with HCFS bugs

125

* 30+ OSD cluster for performance tests

126

** Profiling

127

* hdfs as baseline vs libcephfs benchmark tool...

128

** fio backend?

129

130

h3. Work items

131

132

h4. Coding tasks

133

134

# Task 1

135

# Task 2

136

# Task 3

137

138

h4. Build / release tasks

139

140

# Task 1

141

# Task 2

142

# Task 3

143

144

h4. Documentation tasks

145

146

# Task 1

147

# Task 2

148

# Task 3

149

150

h4. Deprecation tasks

151

152

# Task 1

153

# Task 2

154

# Task 3

Project

General

Profile

Ceph

CephFS - Hadoop Support » History » Version 2