hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7240) Object store in HDFS
Date Fri, 10 Jun 2016 22:20:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325389#comment-15325389

Anu Engineer commented on HDFS-7240:

*Ozone meeting notes – Jun, 9th, 2016*

Attendees: ??Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang, Lei Xu, Aaron
Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Stiwari, Anu Engineer??

We started the discussion with how Erasure coding will be supported in ozone. This was quite
a lengthy discussion taking over half the meeting time. Jing Zhao explained the high-level
architecture and pointed to similar work done by Drobox. 

We then divide into details of this problem, since we wanted to make sure that supporting
Erasure coding will be easy and efficient in ozone.

Here are the major points:

SCM currently supports a simple replicated container. To support Erasure coding, SCM will
have to return more than 3 machines, let us say we were using 6 + 3 model of erasure coding
then then a container is spread across nine machines. Once we modify SCM to support this model,
the container client will have write data to the locations and update the RAFT state with
the metadata of this block.

When a file read happens in ozone, container client will go to KSM/SCM and find out the container
to read the metadata from. The metadata will tell the client where the actual data is residing
and it will re-construct the data from EC coded blocks.

We all agreed that getting EC done for ozone is an important goal, and to get to that objective,
we will need to get the SCM and KSM done first.

We also discussed how small files will cause an issue with EC especially since container would
pack lots of these together and how this would lead to requiring compaction due to deletes.

Eddy brought up this issue of making sure that data is spread evenly across the cluster. Currently
our plan is to maintain a list of machines based on container reports. The container reports
would contain number of keys, bytes stored and number of accesses to that container. Based
on this SCM would be able to maintain a list that allows it to pick machines that are under-utilized
from the cluster, thus ensuring a good data spread. Andrew Wang pointed out that counting
I/O requests is not good enough and we actually need the number of bytes read/written. That
is an excellent suggestion and we will modify container reports to have this information and
will use that in SCMs allocation decisions.

Eddy followed up this question with how would something like Hive behave over ozone? Say hive
creates a bucket, and creates lots of tables and after work, it deletes all the tables. Ozone
would have allocated containers to accommodate the overflowing bucket. So it is possible to
have many empty containers on an ozone cluster.

SCM is free to delete any container that does not have a key. This is because in the ozone
world, metadata exists inside a container. Therefore, if a container is empty, then we know
that no objects (Ozone volume, bucket or key) exists in that container. This gives the freedom
to delete any empty container. This is how the containers would be removed in the ozone world.

Andrew Wang pointed out that it is possible to create thousands of volumes and map them to
similar number of containers. He was worried that it would become a scalability bottle neck.
While is this possible in reality if you have cluster with only volumes – then KSM is free
to map as many ozone volumes to a container. We agreed that if this indeed becomes a problem,
we can write a simple compaction tool for KSM which will move all these volumes to few containers.
Then SCM delete containers would kick in and clean up the cluster.

We reiterated through all the scenarios for merge and concluded the for v1, ozone can live
without needing to support merges of containers.

Then Eddy pointed out that by switching to range partitions from hash partitions we have introduced
a variability in the list operations for a container. Since it is not documented on JIRA why
we switched to using range partition, we discussed the issue which caused us to switch over
to using range partition.

The original design called for hash partition and operations like list relying on secondary
index. This would create an eventual consistency model where you might create key, but it
is visible in the namespace only after the secondary index is updated. Colin argued that is
easier for our users to see consistent namespace operations. This is the core reason why we
moved to using range partitions.

However, range partitions do pose the issue, that a bucket might be split across a large number
of containers and list operation does not have fixed time guarantees. The worst case scenario
is if you have bucket with thousands of 5 GB objects which internally causes that the bucket
to be mapped over a set of containers. This would imply that list operation could have to
be read sequentially from many containers to build the list.

We discussed many solutions to this problem:	
•	In the original design, we had proposed a separate meta-data container and data container.
We can follow the same model, with the assumption that data container and metadata container
are on the same machine. Both Andrew and Thomas seemed to think that is a good idea.

•	Anu argued that this may not be an issue since the datanode (front ends) would be able
to cache lots of this info as well as pre-fetch lists since it is a forward iteration.

•	Arpit pointed out that while this is an issue that we need to tackle, we would need to
build the system, measure and choose the appropriate solution based on data.

•	In an off-line conversation after the call, Jitendra pointed out that this will not have
any performance impact since each split point is well known in KSM, it is trivial to add hints
/ caching in the KSM layer itself to address this issue – In other words, we can issue parallel
reads to all the containers if the client wants 1000 keys and we know that we need to reach
out to 3 containers to get that many keys, since KSM would give us that hint.

While we agree that this is an issue that we might have to tackle eventually in ozone world,
we were not able to converge to an exact solution since we ran out of time at this point.

ATM mentioned that we would benefit by getting together and doing some white boarding of ozone’s
design and we intend to do that soon.

This was a very productive discussion and I want thank all participants.  It was a pleasure
talking to all of you.

Please feel free to add/edit these notes for completeness or corrections.

> Object store in HDFS
> --------------------
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf, ozone_user_v0.pdf
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a generic storage
layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the
storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode storage, but
independent of namespace metadata.
> I will soon update with a detailed design document.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message