hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Duxbury (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-48) [hbase] Bulk load and dump tools
Date Wed, 06 Feb 2008 23:35:07 GMT

    [ https://issues.apache.org/jira/browse/HBASE-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566397#action_12566397
] 

Bryan Duxbury commented on HBASE-48:
------------------------------------

In theory, writing directly to HDFS would be the fastest way to import data. However, the
tricky part in my mind is that you need all the partitions not just to be sorted internally
but sorted amongst each other. This means that the partitioning function you use has to be
able to sort lexically as well. Without knowing what the data looks like ahead of time, how
can you know how to efficiently partition the data into regions?

This doesn't account for trying to import a lot of data into a new table. In that case, it'd
be quite futile to write tons of data into the existing regions range, because that would
just cause the existing regions would just become enormous, and then all you're really doing
is putting off the speed hit until the split/compact stage.

What is it that actually holds back the speed of imports? The API mechanics and nothing else?
The number of region servers participating in the import? The speed of the underlying disk?
Do we even have a sense of what would be a good speed for bulk imports in the first place?
I think this issue needs better definition before we can say what we should do.

> [hbase] Bulk load and dump tools
> --------------------------------
>
>                 Key: HBASE-48
>                 URL: https://issues.apache.org/jira/browse/HBASE-48
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>            Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current
APIs, particularly if the dataset is large and cell content is small, uploads can take a long
time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly
in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message