hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7240) Object store in HDFS
Date Tue, 17 May 2016 03:25:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285948#comment-15285948

Anu Engineer commented on HDFS-7240:

[~andrew.wang] Thank you for your comments, They are well thought out and extremely valuable
questions. I will make sure that all areas that you are asking about is discussed in the next
update of design doc.

bq. Anu said he'd be posting a new design doc soon to address my questions.
I am working on that, but just to make sure your questions are not lost in the big picture
of the design doc, I am answering them individually here.
bq. My biggest concern is that erasure coding is not a first-class consideration in this system.
Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about
the location or the types of chunks at all. So it is quite trivial to create a new chunk type
and write them into containers. We are focused on overall picture of ozone right now, and
I would welcome any contribution you can make on EC and ozone chunks if that is a concern
that you would like us to address earlier. From the architecture point of view I do not see
any issues.

bq. Since LevelDB is being used for metadata storage and separately being replicated via Raft,
are there concerns about metadata write amplification?
Metadata is such a small slice of information of a block – really what you are saying is
that block Name, hash for the block gets written twice, once thru RAFT log and second time
when RAFT commits this information. Since the data we are talking about is so small I am not
worried about it at all.
bq. Can we re-use the QJM code instead of writing a new replicated log implementation? QJM
is battle-tested, and bq.sensus is a known hard problem to get right.
We considered this, however the consensus is to write a *consensus protocol* that is easier
to understand and make it easy for more contributors to work on it. The fact that QJM was
not written as a library makes it very hard for us to pull it out in a clean fashion. Again
if you feel very strongly about it, please feel free to move QJM to a library which can be
reused and all of us will benefit from it.

bq. Are there concerns about storing millions of chunk files per disk? Writing each chunk
as a separate file requires more metadata ops and fsyncs than appending to a file. We also
need to be very careful to never require a full scan of the filesystem. The HDFS DN does full
scans right now (DU, volume scanner).
Nothing in the chunk architecture assumes that chunk files are separate files. The fact that
a chunk is a triplet \{FileName, Offset, Length\} gives you the flexibility to store 1000s
of chunks in a physical file.
 bq. Any thoughts about how we go about packing multiple chunks into a larger file?
Yes, write the first chunk and then write the second chunk to the same file. In fact, chunks
are specifically designed to address the small file problem. So two keys can point to a same
For example
KeyA -> \{File,0, 100\}
KeyB -> \{File,101, 1000\} Is a perfectly valid layout under container architecture
bq. Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability
targets. Together, these factors mean Ozone will likely be doing many more merges and splits
than HBase to keep the container size high
Ozone actively tries to avoid merges and tries to split only when needed. A container can
be thought of as a really large block, so I am not sure if I am going to see anything other
than standard block workload on containers.  The fact that containers can be split, is something
that allows us to avoid pre-allocation of container space. That is merely a convenience and
if you think of these as blocks,  you will see that it is very similar.
Ozone will never try to do merges and splits at HBase level. From the container and ozone
perspective we are more focused on a good data distribution on the cluster – aka what the
balancer does today, and containers are a flat namespace – just like blocks which we allocate
when needed.
So once more – just make sure we are on the same page – Merges are rare(not required generally)
and splits happen if we want to re-distribute data on a same machine.
bq. What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers,
and the metadata services are separate from the NN? not shared?

Great question. We initially started off by attacking the scalability question of ozone and
soon realized that HDFS scalability and ozone scalability has to solve the same problems.
So the container infrastructure that we have built is something that can be used by both ozone
and HDFS. Currently we are focused on ozone and containers will co-exist on datanodes with
blockpools. That is ozone should be and will be deployable on a vanilla HDFS cluster. In future,
if we want to scale HDFS, containers might be an easy way to do it.
bq. Any thoughts on how we will transition applications like Hive and HBase to Ozone? These
apps use rename and directories for synchronization, which are not possible on Ozone.
These applications are written with the assumption of a Posix file system, so migrating them
to Ozone does not make much sense. The work we are doing in ozone, especially container layer
is useful in HDFS and these applications might benefit indirectly.
bq. Have you experienced data loss from independent node failures, thus motivating the need
for copysets? 
Yes, we have seen this issue in real world. Since we had to pick an algorithm for allocation
and copysets offered a set of advantages with very minimal work. Hence the allocation choice.
If you look at the paper you will see this was done on HDFS itself with very minimal cost,
and also this work originally came from facebook. So it is useful in both normal HDFS and
in RAMCloud.
bq. It's also not clear how this type of data placement meshes with EC, or the other quite
sophisticated types of block placement we currently support in HDFS
The chunks will support remote blocks. That is a notion that we did not discuss due to time
constraints in the presentation. In the presentation we just showed chunks as a files on the
same machine, but a chunk could exist on a EC block and key could point to it.
bq. How do you plan to handle files larger than 5GB?
Large files will live on a set of containers, again the easiest model to reason about containers
is to think of them as blocks. In other words, just like normal HDFS.
bq. Large files right now are also not spread across multiple nodes and disks, limiting IO
Are you saying that when you write a file of say 256 MB size you would be constrained to same
machine since the block size is much larger? That is why we have chunks. With chunks we can
easily distribute this information and leave pointers to those chunks.  I agree with the concern
and we would might have to tune the placement algorithms once we see  more real world workloads.

 bq. Are all reads and writes served by the container's Raft master? IIUC that's how you get
strong consistency, but it means we don't have the same performance benefits we have now in
HDFS from 3-node replication.
Yes, and I don't think it will be a bottleneck. Since we are only reading and writing *metadata*
– which is very small – and data can be read from any machine.

bq. I also ask that more of this information and decision making be shared on public mailing
lists and JIRA.

Absolutely, we have been working actively and posting code patches to ozone branch. Feel free
to ask us anything. It is kind of sad that we have to wait for an apachecon for someone to
ask me these questions. I would encourage you to follow the JIRAs tagged with ozone to keep
up with what is happening in the ozone world. I can assure you all the work we are doing is
always tagged with _ozone:_` so that when you get the mail from JIRA it is immediately visible.

bq. The KSM is not mentioned in the architecture document.
The namespace management component did not have separate name, it was called SCM in the original
arch. doc. Thanks to [~arpitagarwal] – he came up with this name (just like ozone) and argued
that we should make it separate, which will  make it  easy to understand and maintain. If
you look at code you will see KSM functionality is currently in SCM. I am planning to move
it out to KSM.

bq. the fact that the Ozone metadata is being replicated via Raft rather than stored in containers.

Metadata is  stored in containers, but containers need replication.  Replication via RAFT
was something that design evolved to, and we are surfacing that to the community. Like all
proposals under consideration, we will update the design doc and listen to community feedback.
You just had a chance to listen to our design (little earlier) in person since we got a chance
meet. Writing documents take far more energy and time than chatting face to face. We will
soon update the design docs.

bq. I not aware that there is already progress internally at Hortonworks on a Raft implementation.
We have been prototyping various parts of the system. Including KSM, SCM and other parts of

bq. We've previously expressed interest in being involved in the design and implementation
of Ozone, but we can't meaningfully contribute if this work is being done privately.
I feel that this is a completely uncalled for statement / misplaced sentiment here. We have
54 JIRAs on ozone so far. You are always welcome to ask questions or comment. I have personally
posted a 43 page document explaining how ozone would look like to users, how they can manage
it and REST interfaces offered by Ozone. Unfortunately I have not seen much engagement. While
I am sorry that you feel left out, I do not see how you can attribute it to engineers who
work on ozone. We have been more than liberal about the sharing internals including this specific
presentation we are discussing here. So I am completely lost when you say that this work is
being done privately. In fact, I would request you to be active on ozone JIRAs, start by asking
questions, make comments (just like you did with this one except for this last comment) and
I would love working with you on ozone. It is never our intent to abandon our fellow contributors
in this journey, but unless you participate in JIRAs it is very difficult for us to know that
you have more than a fleeting interest.

> Object store in HDFS
> --------------------
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, ozone_user_v0.pdf
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a generic storage
layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the
storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode storage, but
independent of namespace metadata.
> I will soon update with a detailed design document.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message