Project

General

Profile

Feature #3730

Support replication factor in Hadoop

Added by Noah Watkins about 11 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Hadoop/Java
Labels (FS):
Java/Hadoop
Pull request ID:

Description

In order to support per-file replication values in Hadoop we need to specify that a new file should be generated in a data pool configured with the desired replication factor.

We'll expand the Hadoop configuration to support a mapping of [int] -> [string] that provides a replication to pool_name mapping for the Ceph installation.

There are 3 cases:

  1. No configuration is given. In this case we use always use the default pool.
  2. Configuration is given and exact replication factor match is found. Use the pool.
  3. Configuration is given and in-exact match is found: use closest match, round up.

History

#1 Updated by Noah Watkins about 11 years ago

  • Assignee set to Noah Watkins

#2 Updated by Noah Watkins about 11 years ago

Someone could toss a 'ceph osd pool set size' Hadoop's way, so a static mapping between pg pool size and pool name could be avoided with an extension to libcephfs interface:

ceph_get_pool_id(name)
ceph_get_pool_replication(pool_id)

which'll let us handle any change dynamically.

#3 Updated by Noah Watkins about 11 years ago

pool ids are currently exposed via libcephfs from ceph_file_layout, which uses a 32bit integer for pool id. However, OSDMap looks like it's using a 64bit integer for pool id. Where should a valid range be enforced here?

#4 Updated by Sam Lang about 11 years ago

Can we change the type in libcephfs to uint64? We're the only ones calling ceph_get_file_pool() right now as far as I know, so this shouldn't have huge impact.

#5 Updated by Noah Watkins about 11 years ago

It looks like in OSDMap there is some mixed usage of int64 and int for pool id, too. In Client::_create pool id is enforced to fit in 32 bits before sending it off to MDS, so using int64 (signed for encoding errors) in libcephfs and keeping it bound for now seems right. I'm not sure what the actual intention of the different types is. maybe just historical?

#6 Updated by Sage Weil about 11 years ago

The move from int32 -> int64 was misguided, and incomplete. At this point it's not really worth the effort to move all the way one way or the other.

But for any userland code, let's stick with int64_t.

#7 Updated by Noah Watkins about 11 years ago

From stand-up, stick with int64_t for userspace, and enforce 32-bit range.

#8 Updated by Noah Watkins about 11 years ago

This interface update is up for review in wip-client-pool-api

#9 Updated by Greg Farnum about 11 years ago

Sorry to back this up a little, but I can't recall — does using libcephfs automatically grant a user access to the RADOS functions? Because this particular one really belongs there instead if we can finagle it.

#10 Updated by Noah Watkins about 11 years ago

I don't think libcephfs will give up an instance of the rados client, if that's what you mean by grant access to rados. I think it makes sense to move the interface to librados, and the only argument against it might be that from libcephfs we can enforce that the pools be 'data pools' (ceph mds add_data_pool).

If this moves to librados, it seems like there is some opportunity for sharing connections, osdmap stuff, etc... perhaps libcephfs could export something like its rados client?

Now, assuming this moves to librados, there are a couple options moving forward on the Java front. I can begin the librados Java bindings, initially with limited functionality (e.g. the API being discussed). The other option is hacky and instantly deprecated. Either way it adds rados dependency to Java. Thoughts?

#11 Updated by Greg Farnum about 11 years ago

Oh right, libcephfs is not built on top of librados. Never mind, that's a whole different discussion we start occasionally and then delay until we have more time. ;)

#12 Updated by Noah Watkins about 11 years ago

In Client, osdmap is protected by client_lock? If so, new version of branch isn't broken..

#13 Updated by Sage Weil about 11 years ago

Noah Watkins wrote:

In Client, osdmap is protected by client_lock? If so, new version of branch isn't broken..

It should be protected.. a bit ago i fixed up several cases where it wasn't. If there are more such cases, that is a separate bug!

#14 Updated by Noah Watkins about 11 years ago

Sage Weil wrote:

If there are more such cases, that is a separate bug!

It was a bug I had introduced in wip-client-pool-api, and then fixed.

#15 Updated by Noah Watkins about 11 years ago

Initial set of tests are in the Hadoop tree and working. Need to add them to Teuthology test thingy. There are now two newly named tests. The previous test is gone. Configuration is now done through a normal configuration XML file for hadoop.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>ceph.conf.file</name>
  <value>/home/nwatkins/projects/ceph/ceph/src/ceph.conf</value>
</property>

<property>
  <name>ceph.data.pools</name>
  <value>data,dp1,dp2,dp8</value>
</property>

</configuration>

These are specified through Java system property:

-Dhadoop.conf.file=/path/to/conf.xml

These are the new tests:

1. TestCephDefaultReplication (should not specify any data pools)
2. TestCephCustomReplication (must specify data pools)

Here is how to setup new pools for the tests.

Adding pool name=dp8, replication=8

ceph osd pool create <name> <pg_num>
ceph osd pool set dp8 size 8

Get pool number

ceph osd dump

pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 8 pgp_num 8 last_change 1 owner 0
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 8 pgp_num 8 last_change 1 owner 0
pool 3 'dp1' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 500 pgp_num 500 last_change 14 owner 0
pool 4 'dp2' rep size 5 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 500 pgp_num 500 last_change 13 owner 0
pool 5 'dp8' rep size 8 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 500 pgp_num 500 last_change 12 owner 0

Set data pool

ceph mds add_data_pool <pool id>

#16 Updated by Greg Farnum about 11 years ago

  • Status changed from New to In Progress

Isn't this now merged or something? :)

#17 Updated by Noah Watkins about 11 years ago

Yes. I should have closed this and opened a separate ticket for the tests. I'm planning to close it as soon as Joe wires up the replications tests into the teuth script.

#18 Updated by Noah Watkins about 11 years ago

  • Status changed from In Progress to Closed

Woot.

#19 Updated by Greg Farnum over 7 years ago

  • Component(FS) Hadoop/Java added

#20 Updated by Patrick Donnelly about 5 years ago

  • Category deleted (48)
  • Labels (FS) Java/Hadoop added

Also available in: Atom PDF