hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
Date Sun, 10 Dec 2017 03:15:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285062#comment-16285062

Jerry He commented on HBASE-15482:

The patch looks good!
I just think the first patch 000 is cleaner.  But, as Ted suggested, change hbase.TableSnapshotInputFormat.locality
to hbase.TableSnapshotInputFormat.locality.enable.  (Change the name SNAPSHOT_INPUTFORMAT_CARE_BLOCK_LOCALITY_KEY
too).  The other changes look unnecessary except making it more complicated.
if (careBlockLocality) {
  Assert.assertTrue(split.getLocations() != null && split.getLocations().length !=
} else {
  Assert.assertTrue(split.getLocations() != null && split.getLocations().length ==
This is ok too.  The first test is an existing test, and it has not failed previously.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -----------------------------------------------------------------------------
>                 Key: HBASE-15482
>                 URL: https://issues.apache.org/jira/browse/HBASE-15482
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Liyin Tang
>            Assignee: Xiang Li
>            Priority: Minor
>             Fix For: 2.1.0
>         Attachments: HBASE-15482.master.000.patch, HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the splits based
on the block locations in order to get best locality. However, this process may take a long
time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side of HBase
cluster. In these scenarios, the block locality doesn't matter. Therefore, it will be great
to have an option to skip calculating the block locations for every job. That will super useful
for the Hive/Presto/Spark connectors.

This message was sent by Atlassian JIRA

View raw message