crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-644) Set HDFS node affinity on created HFiles to improve locality
Date Thu, 27 Apr 2017 14:55:04 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriel Reid updated CRUNCH-644:
--------------------------------
    Attachment: CRUNCH-644.patch

Patch which sets the preferred node at the time of HFile creation. I've tested this patch
on a multi-node cluster and verified that data locality is 100% after a bulk load (before
the patch, the same bulk load resulted in data locality was about 30% after a bulk load).

> Set HDFS node affinity on created HFiles to improve locality
> ------------------------------------------------------------
>
>                 Key: CRUNCH-644
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-644
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>         Attachments: CRUNCH-644.patch
>
>
> When creating HFiles via the {{HFileUtils.writeToHFilesForIncrementalLoad}} method, the
underlying HDFS blocks of the created HFiles will end up on a selection of HDFS data nodes
-- the selection of which nodes is left up to the HDFS Namenode. This means that there is
a relatively small chance (depending on cluster size and replication factor) that the created
HFiles will end up on the same physical machine as the region server which will make use of
these HFiles, which limits the ability to use short-circuit reads to the local file system.
Typically, this lack of locality is only really completely resolved after a major compaction.
> It's possible to set a node affinity on HDFS files at creation time, to provide a suggestion
to the namenode about a preferred data node for blocks to be located on. The intention of
this ticket is to make use of this functionality to set the node affinity during HFile creation
in {{HFileUtils.writeToHFilesForIncrementalLoad}} so that at least one (HDFS) block of each
created HFile will be located on the same physical machine as the region server which will
be using the file (assuming HDFS data nodes are running on the same machines as HBase region
servers).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message