hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-48) [hbase] Bulk load and dump tools
Date Sat, 26 Jul 2008 22:20:31 GMT

    [ https://issues.apache.org/jira/browse/HBASE-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617222#action_12617222

stack commented on HBASE-48:

Thinking more on this issue, in particular on Billy's suggestion above ('Billy Pearson - 06/Feb/08
01:07 PM'), bulk uploading by writing store files ain't hard:

For a new table (as per Bryan above), its particularly easy.  Do something like:

+ Create table in hbase
+ Mark it read-only or disabled even.
+ Start mapreduce job.  In its configuration would go to the master to read the table description.
+ Map reads whatever the input using whatever formatter and ouputs from the map using HStoreKey
for key and cell content for value.
+ Job would use fancy new TableFileReduce.  Each reducer would write a region.  It'd know
what for start and end keys -- they'd be the first and last it'd see.  Could output these
somewhere so a tail task could find them.  The file outputter would need to also do sequenceids
of some form.
+ When job was done, tail task would insert regions into meta using MetaUtils.
+ Enable the table.
+ If regions are lop-sided, hbase will do the fixup.

If table already exists:

+ Mark table read-only (ensure this prevents splits and that it means memcache is flushed)
+ Start a mapreduce job that read from master the table schema and its regions (and the master's
current time so we don't write records older).
+ Map as above.
+ Reducer as above only insert smarter partitioner, one that respected region boundaries and
that made a reducer per current region.
+ Enable hbase and let it fix up where storefiles written were too big by splitting etc.

It don't seem hard at all to do.

> [hbase] Bulk load and dump tools
> --------------------------------
>                 Key: HBASE-48
>                 URL: https://issues.apache.org/jira/browse/HBASE-48
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>            Priority: Minor
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current
APIs, particularly if the dataset is large and cell content is small, uploads can take a long
time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly
in hdfs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message