hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8998) Small files storage supported inside HDFS
Date Thu, 03 Sep 2015 18:55:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729574#comment-14729574
] 

Andrew Wang commented on HDFS-8998:
-----------------------------------

Hi Yong, thanks for posting this doc. It's an interesting read. Have you looked at Ozone (HDFS-7240)
? I think it has similar design goals, and feels more general. It has the concept of storage
containers that pack multiple blobs for more efficient metadata management. I'm also generally
concerned with implementing our own compaction logic, considering there are local DBs like
LevelDB/RocksDB that can do this for us (as alluded to by Ozone). The design you proposed
sounds like it needs compaction to be coordinated by the NN, rather than offloading to the
DNs. Level/RocksDB I think would also better handle concurrent writes without the concept
of "locked" and "unlocked" blocks.

Also, could you comment on the usecase where you see the issues with # of files affecting
DNs before NNs? IIUC this design does not address NN memory consumption, which is the issue
we see first in practice.

Few other things it'd be nice to see in the doc:

* Goal # of files, expected size of a "small" file
* Any bad behavior if a large file is accidentally written to the small file zone?
* Support for rename into / out of small file zone?
* Is there a way to convert a bunch of small files into a compacted file, like with HAR?
* How common is it for a user to know apriori that a bunch of small files will be written,
and is okay putting them in a zone? A lot of the time I see this happening by accident, either
a poorly written app or misconfiguration.

> Small files storage supported inside HDFS
> -----------------------------------------
>
>                 Key: HDFS-8998
>                 URL: https://issues.apache.org/jira/browse/HDFS-8998
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Yong Zhang
>            Assignee: Yong Zhang
>         Attachments: HDFS-8998.design.001.pdf
>
>
> HDFS has problems on store small files, just like this blog said (http://blog.cloudera.com/blog/2009/02/the-small-files-problem).
> This blog also tell us some way how to store small file in HDFS, but they are not good
way, seems HAR files and Sequence Files are better for read-only files.
> Current each HDFS block is only for one HDFS file, if too many small file there, many
small blocks will be in DataNode, which will make DataNode heavy loading.
> This jira will show how to online merge small blocks to big one, and how to delete small
file, and so on.
> Cerrentlly we have many open jira for improving HDFS scalability on NameNode, such as
HDFS-7836, HDFS-8286 and so on. 
> So small file meta (INode and BlocksMap) will also be in NameNode.
> Design document will be uploaded soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message