crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkwhita...@gmail.com>
Subject Re: HBase & Crunch: multiple scans for a single PTable
Date Mon, 08 Apr 2013 20:52:09 GMT
What's the minimum supported version of HBase Crunch will support?  We have
the exact same need but because the fix for HBASE-3996 and its requirement
for region server changes it wasn't as each to patch back to 0.92 or 0.94.2
(CDH 4.2).



On Mon, Apr 8, 2013 at 3:47 PM, Josh Wills <jwills@cloudera.com> wrote:

> Maybe we need something based on this?
>
> https://issues.apache.org/jira/browse/HBASE-3996
>
>
> On Mon, Apr 8, 2013 at 1:41 PM, Chad Urso McDaniel <chadum@gmail.com>wrote:
>
>> This may be a core hadoop question.
>>
>> We are using Crunch with HBase.
>> We typically set up the input PTable like so:
>> ---
>>       Scan scan = ...
>>       HBaseSourceTarget source = new HBaseSourceTarget(tableName, scan);
>>       PTable<ImmutableBytesWritable, Result> data = pipeline.read(source);
>> ---
>>
>> A use case that we want to use in order to speed up the processing with
>> Crunch is using multiple Scans into one PTable.
>>
>> We know which sections of the HBase table we want and they are not
>> contiguous.
>>
>> We have tried unioning the PTables but that turns out to be incredibly
>> slow.
>> Currently we are using a filter that results in many unnecessary reads.
>>
>> How do others solve this?
>>
>> I'm temped to write a TableSource that can do this.
>>
>> thanks
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message