hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anoop Sam John (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8073) HFileOutputFormat support for offline operation
Date Mon, 13 Apr 2015 07:33:13 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492013#comment-14492013

Anoop Sam John commented on HBASE-8073:

Seems the approach in V2 is ok.  It allows to specify the splits as well as other imp table/cf
attributes like compression, block size etc.  The DFS level read would have allowed it to
be read by code, but the user permission can be really an issue.   As this change allows to
pass the splits, one can have a smarter i/p data sampler code sit in front and make the splits
as pass to configureIncrementalLoad().

> HFileOutputFormat support for offline operation
> -----------------------------------------------
>                 Key: HBASE-8073
>                 URL: https://issues.apache.org/jira/browse/HBASE-8073
>             Project: HBase
>          Issue Type: Sub-task
>          Components: mapreduce
>            Reporter: Nick Dimiduk
>             Fix For: 1.1.0
>         Attachments: HBASE-8073-trunk-v0.patch, HBASE-8073-trunk-v1.patch
> When using HFileOutputFormat to generate HFiles, it inspects the region topology of the
target table. The split points from that table are used to guide the TotalOrderPartitioner.
If the target table does not exist, it is first created. This imposes an unnecessary dependence
on an online HBase and existing table.
> If the table exists, it can be used. However, the job can be smarter. For example, if
there's far more data going into the HFiles than the table currently contains, the table regions
aren't very useful for data split points. Instead, the input data can be sampled to produce
split points more meaningful to the dataset. LoadIncrementalHFiles is already capable of handling
divergence between HFile boundaries and table regions, so this should not pose any additional
burdon at load time.
> The proper method of sampling the data likely requires a custom input format and an additional
map-reduce job perform the sampling. See a relevant implementation: https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java

This message was sent by Atlassian JIRA

View raw message