hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Dimiduk (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-8073) HFileOutputFormat support for offline operation
Date Tue, 12 Mar 2013 00:39:12 GMT
Nick Dimiduk created HBASE-8073:
-----------------------------------

             Summary: HFileOutputFormat support for offline operation
                 Key: HBASE-8073
                 URL: https://issues.apache.org/jira/browse/HBASE-8073
             Project: HBase
          Issue Type: New Feature
          Components: mapreduce
            Reporter: Nick Dimiduk


When using HFileOutputFormat to generate HFiles, it inspects the region topology of the target
table. The split points from that table are used to guide the TotalOrderPartitioner. If the
target table does not exist, it is first created. This imposes an unnecessary dependence on
an online HBase and existing table.

If the table exists, it can be used. However, the job can be smarter. For example, if there's
far more data going into the HFiles than the table currently contains, the table regions aren't
very useful for data split points. Instead, the input data can be sampled to produce split
points more meaningful to the dataset. LoadIncrementalHFiles is already capable of handling
divergence between HFile boundaries and table regions, so this should not pose any additional
burdon at load time.

The proper method of sampling the data likely requires a custom input format and an additional
map-reduce job perform the sampling. See a relevant implementation: https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message