hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7240) Object store in HDFS
Date Mon, 06 Jun 2016 19:16:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317035#comment-15317035

Anu Engineer commented on HDFS-7240:

Hi [~eddyxu] , Thank  you for reviewing the design doc and comments.  Please see my comments

bq. Since Ozone is decided to use range partition, how would key / data distribution achieve
balancing from initial state? For example, a user Foo runs Hive and creates 10GB of data,
these data are distributed to up to 6 (containers) DNs?

You bring up a very valid point. This was the most contentious issue in ozone world for a
while. We originally went with hash partition schemes and secondary index because of these
concerns. The issue (and very rightly so) with that approach was that secondary index is eventually
consistent and makes it hard to use. So we switched over to this scheme. 

So our current thought is this, each of the containers will report -- size, number of operations
and number of keys to SCM. This will allow SCM to balance the allocation of the key space.
So if you have a large number of reads and writes, which are completely independent, then
they will fill up the cluster/container space evenly.

But we have an opposing requirement here, generally there is a locality of access in the namespace.
So for most cases if you are reading and writing to a bucket, then it is most efficient to
keep that data together.

Now let us look at this specific case, if you have containers configured to say 2GB, then
10GB of data will map to 5 containers. So the model works out to 5 containers. These containers
will be spread across a set of machines due to the SCM’s location choosing algorithms.

bq. Would you explain what is the benefit of recovering failure pipeline by using a parallel
writes to all 3 containers? It is not very clear in the design.

The point I was trying to make is that pipeline relies on Quorum as defined by RSM. 
So if we decide to use this pipeline with RAFT, then I was just trying to make a point that
pipeline can be broken, and we will not attempt to heal it. Please let me know if this makes

bq. How does ozone differentiate a recover write from a malicious (or buggy) re-write?

 Thanks for flagging this, right now we do not. We can always prevent it in the container
layer. It is small extension to make, we can write to a temporary file and replace the original
if and only if the hashes match. I will file a work item to fix this.

bq. You mentioned that KMS/SCM separation is for future scalability. Do KMS / SCM maintains
1:1, 1:n or n:m relationship? Though it is not in this phase. I'd like to know whether it
is considered. Btw, they are also Raft replicated?

KSM:SCM has a n:m relationship. Even though in easiest deployment configuration it is 1:1.
So yes it is defined that way. They are always Raft replicated.

bq. The raft ring / leader is per-container?

Yes, and No. Let me explain this a little more. If you think only in terms of RAFT protocol,
then we have a RAFT leader is per machine set. That is, we are going to have a leader for
3 machines (assuming a 3 machine RAFT ring). Now let us switch over to a developer’s point
of view. Someone like me who is writing code against containers thinks strictly in terms of
containers. So from an ozone developers point of view, we have a Raft leader for a container.
 In other words, containers provide an abstraction that makes you think that RAFT protocol
is for the container, whereas in reality it is a shared ring that is used by many containers
that share those 3 machines. This might be something that we want to explore in greater depth
during the call.

bq. For pipeline, say if we have a pipeline A->B->C, if the data writes successfully
on A->B, and the metadata Raft writes are succeed on B,C, IIUC, that is a What would be
the result for a read request sent to A or C?

I am going to walk thru this with little more details, so that we are all on the same page.

What you are describing is a situation where the RAFT leader is either B or C (Since RAFT
is an active leader protocol) and for the sake of this illustration let us assume that we
are talking about 2 scenarios. One where data is written to leader and another datanode and
scenario two, where data is written to followers but not to the leader. 

Let us look at both in greater detail.

Case 1: Data is written to machines B (leader) and Machine A. But when RAFT commit happens,
machine A is off-line and RAFT data is written to Machine B and Machine C.

So we have situation where B is the only machine with metadata as well as data. We deal with
this issue in two ways, one when the commit callback happens in C, C will check if it has
the data block and since it does not, it will attempt to copy that block from either B or

Also when A's RAFT ring comes back up it will catch up with the RAFT log and the data is already
available on Machine A. 

So in both cases, we are replicating data/metadata as soon as we can. Now let us look at the
case where a client goes to C and says I want this data block, before the copy is done --
the client will feel that read is a bit slow, since Machine C will copy the data from Machine
B or A, write to its local storage and then return the data.

Case 2: Data is written to 2 followers and leader does not have the data block. The work flow
is identical; leader will copy block from another machine before returning the block.

Case 3: I also want to illustrate an extreme case, let us say a client did NOT write any data
blocks, and attempted to write a key, a key will get committed, but container will not be
able to find the data blocks at all. Since no data blocks were written by the client, the
copy attempt will fail, and RAFT leader will learn that this is Block with No replicas. This
would be similar in nature to HDFS.

bq. How to handle split (merge, migrate) container during writes?

I have made an eloquent argument about why we don't need to do merge in the first release
of Ozone. 

When split is happening, the easiest way to deal with it is to pause the writes. 

if you don't mind, could you please take a look at 
http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf -
slides 38-42. 
I avoided repeating that in the design doc, since it was already quite large. We can go over
this in detail if you like during the call.

bq. Since container size is determined by the space usage instead of # of keys, would that
result large performance variants on listing operation. 
You are absolutely right; it can have variation in performance. The alternative we have is
to use hash partition with secondary indices. if you like we can revisit hash/range partition
in the conf, call. 
Last time, we decided to have range as the primary method, but reserved the option of bringing
hash partition back at a later stage. 

> Object store in HDFS
> --------------------
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf, ozone_user_v0.pdf
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a generic storage
layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the
storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode storage, but
independent of namespace metadata.
> I will soon update with a detailed design document.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message