hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Agarwal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6482) Use block ID-based block layout on datanodes
Date Mon, 09 Jun 2014 19:01:10 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025559#comment-14025559

Arpit Agarwal commented on HDFS-6482:

{{DFS_DATANODE_NUMBLOCKS_DEFAULT}} is currently 64. I am not sure why the default was set
so low. It would be good to know the reason before we change the behavior. It was quite possibly
an arbitrary choice.

After ~4 million blocks we would start putting more than 256 blocks in each leaf subdirectory.
With every 4M blocks, we'd add 256 files to each leaf. I think this is fine since 4 million
blocks itself is going to be very unlikely. I recall as late as Vista NTFS directory listings
would get noticeably slow with thousands of files per directory. Is there any performance
loss with always having three levels of subdirectories, restricting each to 256 children at
the most?

- Who removes empty subdirectories when blocks are deleted?
- Let's avoid suffixing hex numerals to "subdir" for consistency with the existing naming
- StringBuilder looks unnecessary in {{idToBlockDir}}.
- We should add a release note stating that {{DFS_DATANODE_NUMBLOCKS_DEFAULT}} is obsolete.

The approach looks good and a big +1 for removing LDir.

> Use block ID-based block layout on datanodes
> --------------------------------------------
>                 Key: HDFS-6482
>                 URL: https://issues.apache.org/jira/browse/HDFS-6482
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 2.5.0
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: HDFS-6482.1.patch, HDFS-6482.2.patch, HDFS-6482.patch
> Right now blocks are placed into directories that are split into many subdirectories
when capacity is reached. Instead we can use a block's ID to determine the path it should
go in. This eliminates the need for the LDir data structure that facilitates the splitting
of directories when they reach capacity as well as fields in ReplicaInfo that keep track of
a replica's location.
> An extension of the work in HDFS-3290.

This message was sent by Atlassian JIRA

View raw message