crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: HBase & Crunch: multiple scans for a single PTable
Date Mon, 08 Apr 2013 20:56:48 GMT
Like, would Crunch support 0.94.5? I'm not really sure: our HBase
dependencies are pretty minimal, which makes me think that creating a
MultiTableInputFormat Source would be easy to write, but HBase has a
tendency to change out from underneath us in ways that I have a hard time
diagnosing w/o help from folks who know it better than I do.


On Mon, Apr 8, 2013 at 1:52 PM, Micah Whitacre <mkwhitacre@gmail.com> wrote:

> What's the minimum supported version of HBase Crunch will support?  We
> have the exact same need but because the fix for HBASE-3996 and its
> requirement for region server changes it wasn't as each to patch back to
> 0.92 or 0.94.2 (CDH 4.2).
>
>
>
> On Mon, Apr 8, 2013 at 3:47 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Maybe we need something based on this?
>>
>> https://issues.apache.org/jira/browse/HBASE-3996
>>
>>
>> On Mon, Apr 8, 2013 at 1:41 PM, Chad Urso McDaniel <chadum@gmail.com>wrote:
>>
>>> This may be a core hadoop question.
>>>
>>> We are using Crunch with HBase.
>>> We typically set up the input PTable like so:
>>> ---
>>>       Scan scan = ...
>>>       HBaseSourceTarget source = new HBaseSourceTarget(tableName, scan);
>>>       PTable<ImmutableBytesWritable, Result> data =
>>> pipeline.read(source);
>>> ---
>>>
>>> A use case that we want to use in order to speed up the processing with
>>> Crunch is using multiple Scans into one PTable.
>>>
>>> We know which sections of the HBase table we want and they are not
>>> contiguous.
>>>
>>> We have tried unioning the PTables but that turns out to be incredibly
>>> slow.
>>> Currently we are using a filter that results in many unnecessary reads.
>>>
>>> How do others solve this?
>>>
>>> I'm temped to write a TableSource that can do this.
>>>
>>> thanks
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message