hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiang Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
Date Tue, 19 Dec 2017 17:08:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297084#comment-16297084

Xiang Li commented on HBASE-15482:

Regarding the UT:
For mapred, the number of splits generated(=10) is exactly the same as numRegions specified(=10),
while only 8 of them has location not being an empty array.
For mapreduce, only 8 splits are generated when there is 10 regions.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -----------------------------------------------------------------------------
>                 Key: HBASE-15482
>                 URL: https://issues.apache.org/jira/browse/HBASE-15482
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Liyin Tang
>            Assignee: Xiang Li
>            Priority: Minor
>             Fix For: 2.1.0
>         Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, HBASE-15482.master.001.patch,
HBASE-15482.master.002.patch, HBASE-15482.master.003.patch
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the splits based
on the block locations in order to get best locality. However, this process may take a long
time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side of HBase
cluster. In these scenarios, the block locality doesn't matter. Therefore, it will be great
to have an option to skip calculating the block locations for every job. That will super useful
for the Hive/Presto/Spark connectors.

This message was sent by Atlassian JIRA

View raw message