hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "rajeshbabu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9556) Provide key range support to bulkload to avoid too many reducers even the data belongs to few regions
Date Tue, 17 Sep 2013 16:57:01 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769676#comment-13769676
] 

rajeshbabu commented on HBASE-9556:
-----------------------------------

if user not specify the start and/or end key range then we are setting the regions count as
number of reducers and regions start keys as split points. If user known the range beforehand
then we can identify the proper split points within the range and reduce number of reducers.
{code}
    List<ImmutableBytesWritable> startKeys = getRegionStartKeys(table);
    LOG.info("Configuring " + startKeys.size() + " reduce partitions " +
        "to match current region count");
    job.setNumReduceTasks(startKeys.size());

    configurePartitioner(job, startKeys);
{code}
                
> Provide key range support to bulkload to avoid too many reducers even the data belongs
to few regions
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-9556
>                 URL: https://issues.apache.org/jira/browse/HBASE-9556
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: rajeshbabu
>            Assignee: rajeshbabu
>            Priority: Minor
>
> Presently the number of reducers in bulk load are equal to number of regions.
> Lets suppose a table has 500 regions and import data only belongs 10 regions, still we
are starting 500(equal to no. of regions) reducers instead of 10. Which will consume more
time and resources. 
> If user knows the row key range of import data, then we can pass startkey and/or endkey
as input and based on the key range we can define the partitions and number of reducers(regions
to which the data belongs). This helps to avoid too many reducers to start and do nothing
and also avoids contention in shuffling.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message