crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Shi (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-212) Need target wrapper for HFileOuptutFormat
Date Mon, 22 Jul 2013 02:34:49 GMT


Chao Shi commented on CRUNCH-212:

Hi Reid, I haven't thought on that thoroughly yet. 

bq. - setting up the partitioning to match regions on an existing HBase table
I think we have to set up a TotalOrderPartitioner. The partition boundaries are determined
from a scan on ".META.".

bq. - handling multiple column families
I think we can take PCollection<KeyValue> as input from user, then divide them into
multiple PCollection<KeyValue> by their families. Then sort per family and write them
to HFile targets. This requires user to explicitly tell use what are the column families are
used, as crunch cannot determine how many ways of output at runtime. This approach looks more
"crunch-style". :)

Any suggestions are welcome.
> Need target wrapper for HFileOuptutFormat
> -----------------------------------------
>                 Key: CRUNCH-212
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: IO
>            Reporter: Chao Shi
>         Attachments: crunch-212-draft.patch
> I need to import data to hbase from MR. I found HFileOutputFormat is ~5x more efficient
than HTableOutputFormat. So maybe we need a target wrapper for it.
> Future more, is it possible to call HBase to load it automatically after HFiles are generated?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message