hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6482) Use block ID-based block layout on datanodes
Date Tue, 03 Jun 2014 23:16:02 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017246#comment-14017246
] 

Colin Patrick McCabe commented on HDFS-6482:
--------------------------------------------

{code}
+  public static long getBlockIdFromBlockOrMetaFile(String blockOrMetaFile) {
+    long metaTry = getBlockId(blockOrMetaFile);
{code}

Rather than have a new function, how about fixing {{getBlockId}} to work on either metadata
or data files?  It would require a new regular expression.  However, both meta and block files
begin with the block ID and then follow with a non-numeric character (or the end of the file
name) so it shouldn't be too bad to write the regex for that.

{code}
    BLOCKID_BASED_LAYOUT(-55,
        "The block ID of a block uniquely determines its position in the " +
        "directory structure, obviating the need to keep per-block " +
        "directory information in memory.");
{code}

It might be better just to write "The block ID of a block uniquely determines its position
in the directory structure".  The rest of the descriptions are pretty short.

{code}
+    // If we are upgrading from a version older than the one where we introduced
+    // block ID-based layout AND we're working with the finalized directory,
+    // we'll need to upgrade from the old flat layout to the block ID-based one
+    if (oldLV > LayoutVersion.Feature.BLOCKID_BASED_LAYOUT.getInfo().
+        getLayoutVersion() && to.getName().equals(STORAGE_DIR_FINALIZED)) {
+      upgradeToIdBasedLayout = true;
{code}

This new layout applies to the rbw directory as well, right?

{code}
  File getDir() {
    try {
      return new IdBasedBlockDirectory(baseDir).getDirectory(getBlockId());
    } catch (IOException ioe) {
      return null; // won't happen since directory for this block already exists
    }
  }
{code}

It seems like it would be better to just have a static method or something in {{IdBasedBlockDirectory}}
that returned a File object.  "This can't happen" code is scary, especially when we're dealing
with filesystem operations like mkdir.

Remember that a File object can exist, even though the corresponding file on disk does not.
 The object just contains a path, basically.  So let's just return that File from somewhere
and skip the mkdir.

{code}
import org.apache.hadoop.hdfs.server.datanode.*;
{code}

We generally don't use wildcard includes... I think maybe your editor did this automatically.
 IntelliJ did that to me once :)  There's a setting on IntelliJ to turn that off.

{code}
  // directory store Finalized replica
  private final IdBasedBlockDirectory finalizedDir;
{code}
While we're moving the comment, let's make it grammatical

{code}
    this.finalizedDir = new IdBasedBlockDirectory(finalizedDir);
{code}
It seems kind of tricky to have two variables with the same name.  I would say rename one
or the other, or don't bother with a local variable for finalizedDir at all (nested new statements).

{code}
  static private long hashBlockId(long n) {
    return (n + 378734493671000L) * 9223372036854775783L;
  }

  public File getDirectory(long blockId) throws IOException {
    long h = hashBlockId(blockId);
    int d1 = (int)((h >> 56) & 0xff);
    int d2 = (int)((h >> 48) & 0xff);
{code}

So you're creating a path which looks like: a/b/c

How about taking bits 8-16 for a, bits 16-24 for b, and the rest for c?  (Notice that the
lowest bits are part of c.)

This has some nice effects.  In combination with our sequential block allocation strategy,
it means that the first 256 files all go in the same directory, avoiding the need to make
2 directories per file.  The next 256 go in a different distinct directory, and so on.

The thing to keep in mind is that we don't really want each block in its own directory...
we just want to avoid overloading directories.  We should eschew hashing so that we never
need to worry about collisions.  With the scheme I outlined, we can go up to approximately
a (power of two) billion blocks (2**30) without ever exceeding 16384 files per directory.
 At a billion blocks per Datanode, we have bigger problems than directory structure, of course
:)

> Use block ID-based block layout on datanodes
> --------------------------------------------
>
>                 Key: HDFS-6482
>                 URL: https://issues.apache.org/jira/browse/HDFS-6482
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 2.5.0
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: HDFS-6482.patch
>
>
> Right now blocks are placed into directories that are split into many subdirectories
when capacity is reached. Instead we can use a block's ID to determine the path it should
go in. This eliminates the need for the LDir data structure that facilitates the splitting
of directories when they reach capacity as well as fields in ReplicaInfo that keep track of
a replica's location.
> An extension of the work in HDFS-3290.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message