hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-48) [hbase] Bulk load and dump tools
Date Fri, 31 Jul 2009 21:50:14 GMT

     [ https://issues.apache.org/jira/browse/HBASE-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

stack updated HBASE-48:

    Attachment: 48.patch

Here is a patch to add to classes to MapReduce: KeyValueSortReducer and HFileOutputFormat.
 This patch also adds a small test class that runs a MR job that has custom mapper and inputformat.
 The inputformat produces PerformanceEvaluation type keys and values (keys are a zero-padded
long and values are random 1k of bytes).  The mapper takes this inputformat and outputs the
key as row and then makes a KeyValue of the row, a defined column and the value.

KeyValueSortReducer takes as input an ImmutableBytesWritable as key/row.  It then pulls on
the Iterator to read in all of the passed KeyValues, sorts then, and then starts outputting
the sorted key/row+KeyValue.

HFileOutputFormat takes ImmutableBytesWritable and KeyValue. On setup, it reads configuration
for stuff like blocksize and compression to use.  It then writes HFiles of < hbase.hregion.max.filesize

Next I'll work on a script that takes an HTableDescriptor and some other parameters and that
then puts the output of this MR into proper layout in HDFS with an hfile per region making
proper insertions into catalog tables.

> [hbase] Bulk load and dump tools
> --------------------------------
>                 Key: HBASE-48
>                 URL: https://issues.apache.org/jira/browse/HBASE-48
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>            Priority: Minor
>         Attachments: 48.patch
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current
APIs, particularly if the dataset is large and cell content is small, uploads can take a long
time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly
in hdfs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message