CephFS - Hadoop Support


Overview of the current status of Hadoop support on Ceph. what we are working on now, and the development roadmap.


  • Noah Watkins (RedHat, UCSC)
  • Name (Affiliation)
  • Name

Interested Parties

  • Name (Affiliation)
  • Name (Affiliation)
  • Name

Current Status

Results from HCFS Test Suite

The HCFS tests are now in hadoop-common. We are running them against our cephfs-hadoop bindings and have been squashing bugs for the past couple weeks. This is the current state of issues:

HCFS Resources


  • Tests run: 61, Failures: 3, Errors: 1, Skipped: 4
  • Errors:
  • Skipped:
    • File concatenation API
      • void concat(finalPath target, finalPath [] sources)
      • This is a little-used operation currently implemented only by HDFS.
      • Support with a simple re-write hack
      • Optimized CephFS support?
    • ?Root directory tests
      • ?libcephfs bug rmdir("/")
      • #9935
  • Failures:
    • testRenameFileOverExistingFiles
    • testRenameFileNonexistentDir?
      • Rename semantics for HCFS are complicated.
      • Is rename in Ceph atomic?
        • According to HCFS we only need the core rename op to be atomic, and the rest of semantics can be emulated in our binding.
    • testNoMkdirOverFile?

BigTop/ceph-qa-suite Tests

  • Not completed, supposedly very easy
  • Integration
    • ceph-qa-suite
    • Jenkins?

Clock Sync

  • I haven't seen this issue come up in a long time
  • #1666

Snapshots and Quotas

Haven't investigated the Ceph side of this. There are documents describing HDFS behavior for reference.

Client Shutdown Woes

When processes using libcephfs exit without first unmounting, other clients may experience delays (e.g. `ls`) waiting for timeouts to expire. There are a few scenarios that we've run into.

Scenario 1

Some processes just don't shutdown cleanly. These are relatively easy to identify on a case-by-case basis. For instance, it looks like this is true for MRAppMaster and there is an open bug report for this Generally the file systems will be closed automatically unless explicit control is requested. This hasn't been an issue.

Scenario 2

  1. Map tasks finish, broadcast success
  2. Simultaneously
    1. SIGTERM->map tasks, 250ms delay, SIGKILL->map tasks
    2. Application master examines file system to verify success

In this scenario SIGTERM will invoke file system clean-up (i.e. libcephfs unmount) on all the clients, but the 250ms delay isn't an adequate delay for libcephfs unmounting. The result is that the application master hangs for about 30 seconds. The solution is to increase the delay before SIGKILL is sent.

Curiously, it doesn't appear that libcephfs clients need to fully unmount, they only need to make it far enough through the process. Even when the processes are given a 30 second delay before SIGKILL (this is in YARN), many of the ceph client logs are truncated within ceph_unmount, so it appears they are exiting/killed through another path.


This is really a generalization of the previous scenario, but it will occur for any reason the task can't reach ceph_unmount.
  1. YARN wants to kill a task that has mounted ceph, sends SIGTERM
  2. The task being killed isn't able to invoke shutdown within the delay before SIGKILL?
Some cases I've seen recently
  1. Client stuck in fsync for 40 seconds due to laggy osds
    1. CephFS-Java prevents ceph_unmount from racing with other operations
      1. Perhaps this should cause other threads to abort their operations
  2. They could be stuck due to other clients' unclean shutdown
    1. Some sort of general cascading problem
  3. But could generally be stuck for any reason

Take Aways

  • Always prefer clients to shutdown cleanly
    • Through normal process exit paths
    • Asynchronously from signal (SIGTERM + delay + SIGKILL)
      • Shorter (bounded?) unmount cost
    • Process stuck in libcephfs
      • ?Unmount can force clean up threads?
  • Forced exit without reaching unmount
    • Maybe not a common case, no big deal
    • How to avoid cascading problems


  • Doesn't appear to define any sort of semantics for closing a the file system, which suggests that all the important things are handled by the semantics of file.close/file.flush.
  • In the process of clarifying these points

Next Steps

  • Finishing with HCFS bugs
  • 30+ OSD cluster for performance tests
    • Profiling
  • hdfs as baseline vs libcephfs benchmark tool...
    • fio backend?

Work items

Coding tasks

  1. Task 1
  2. Task 2
  3. Task 3

Build / release tasks

  1. Task 1
  2. Task 2
  3. Task 3

Documentation tasks

  1. Task 1
  2. Task 2
  3. Task 3

Deprecation tasks

  1. Task 1
  2. Task 2
  3. Task 3