hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wymer (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs
Date Fri, 06 Jan 2012 20:50:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181579#comment-13181579

Josh Wymer commented on HBASE-5140:

Correct but for example on a table with one region, getStartEndKeys() returns two empty byte[].
The last region (or only region) for the table will return empty byte[] as the end key allowing
the scan to scan to the end of the table. Therefore, we don't know the upper bound byte[]
to use in order to determine the long (or int, etc) value we want to use for split calculations.
So we must either have an efficient way to get the last key in this case or arbitrarily set
the long to it's max value (since in any case nothing could be higher) and use that number
to make the calculations. This obviously won't work for unbound data types like BigDecimal
and is a partial solution at best.
> TableInputFormat subclass to allow N number of splits per region during MR jobs
> -------------------------------------------------------------------------------
>                 Key: HBASE-5140
>                 URL: https://issues.apache.org/jira/browse/HBASE-5140
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce
>            Reporter: Josh Wymer
>            Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I am working
on a subclass for the TableInputFormat class that overrides getSplits in order to generate
N number of splits per regions and/or N number of splits per job. The idea is to convert the
startKey and endKey for each region from byte[] to BigDecimal, take the difference, divide
by N, convert back to byte[] and generate splits on the resulting values. Assuming your keys
are fully distributed this should generate splits at nearly the same number of rows per split.
Any suggestions on this issue are welcome.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message