hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "rajeshbabu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-8768) Improve bulk load performance by moving key value construction from map phase to reduce phase.
Date Fri, 26 Jul 2013 12:31:49 GMT

     [ https://issues.apache.org/jira/browse/HBASE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

rajeshbabu updated HBASE-8768:

    Attachment: HBASE-8768_v2.patch

Patch for trunk. New custom mapper and reducer are introduced in this patch.

TsvImporterTextMapper : Parse rowkey and write the line bytes as it is to mapoutput. This
is needed because existing mapper will help in non-bulkloading case where we will write puts
directly to table.

TextSortReducer: Parse values,prepare kvs from text line and write to hfile after sorting
kvs of each row.

Please review the patch.

Thanks [~jyothi.mandava] for internal review and performance report.

> Improve bulk load performance by moving key value construction from map phase to reduce
> ----------------------------------------------------------------------------------------------
>                 Key: HBASE-8768
>                 URL: https://issues.apache.org/jira/browse/HBASE-8768
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce, Performance
>            Reporter: rajeshbabu
>            Assignee: rajeshbabu
>         Attachments: HBASE-8768_v2.patch
> ImportTSV bulkloading approach uses MapReduce framework. Existing mapper and reducer
classes used by ImportTSV are TsvImporterMapper.java and PutSortReducer.java. ImportTSV tool
parses the tab(by default) seperated values from the input files and Mapper class generates
the PUT objects for each row using the Key value pairs created from the parsed text. PutSortReducer
then uses the partions based on the regions and sorts the Put objects for each region. 
> Overheads we can see in the above approach:
> ==========================================
> 1) keyvalue construction for each parsed value in the line adding extra data like rowkey,columnfamily,qualifier
which will increase around 5x extra data to be shuffled in reduce phase.
> We can calculate data size to shuffled as below
> {code}
>  Data to be shuffled = nl*nt*(rl+cfl+cql+vall+tsl+30)
> {code}
> If we move keyvalue construction to reduce phase we datasize to be shuffle will be which
is very less compared to above.
> {code}
>  Data to be shuffled = nl*nt*vall
> {code}
> nl - Number of lines in the raw file
> nt - Number of tabs or columns including row key.
> rl - row length which will be different for each line.
> cfl - column family length which will be different for each family
> cql - qualifier length
> tsl - timestamp length.
> vall- each parsed value length.
> 30 bytes for kv size,number of families etc.
> 2) In mapper side we are creating put objects by adding all keyvalues constructed for
each line and in reducer we will again collect keyvalues from put and sort them.
> Instead we can directly create and sort keyvalues in reducer.
> Solution:
> ========
> We can improve bulk load performance by moving the key value construction from mapper
to reducer so that Mapper just sends the raw text for each row to the Reducer. Reducer then
parses the records for rows and create and sort the key value pairs before writing to HFiles.

> Conclusion:
> ===========
> The above suggestions will improve map phase performance by avoiding keyvalue construction
and reduce phase performance by avoiding excess data to be shuffled.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message