crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkwhita...@gmail.com>
Subject Re: HBase & Crunch: multiple scans for a single PTable
Date Mon, 08 Apr 2013 21:09:34 GMT
We have a hack of a MultiScanTableInputFormat based off of one of the
earlier patches.  It is nice because it gives us the functionality we
wanted but does have issues such as not honoring filters per scan object,
limit with then number of scans that can be serialized, and some overhead
cost kicking off the multiple scans.

Based on that we actually took the approach of trying to get HBASE-3996
resolved so Crunch could have a first class Source which utilizes the new
input format.  Of course that is dependent on you coding against that API
and us being able to upgrade to 0.94.5.  So I was asking from a "when would
this fit onto Crunch's roadmap?" perspective.

We actually found that a custom filter with good hints for jumping sections
can be as performant as our forked custom MultiScanTableInputFormat.


On Mon, Apr 8, 2013 at 3:56 PM, Josh Wills <jwills@cloudera.com> wrote:

> Like, would Crunch support 0.94.5? I'm not really sure: our HBase
> dependencies are pretty minimal, which makes me think that creating a
> MultiTableInputFormat Source would be easy to write, but HBase has a
> tendency to change out from underneath us in ways that I have a hard time
> diagnosing w/o help from folks who know it better than I do.
>
>
> On Mon, Apr 8, 2013 at 1:52 PM, Micah Whitacre <mkwhitacre@gmail.com>wrote:
>
>> What's the minimum supported version of HBase Crunch will support?  We
>> have the exact same need but because the fix for HBASE-3996 and its
>> requirement for region server changes it wasn't as each to patch back to
>> 0.92 or 0.94.2 (CDH 4.2).
>>
>>
>>
>> On Mon, Apr 8, 2013 at 3:47 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Maybe we need something based on this?
>>>
>>> https://issues.apache.org/jira/browse/HBASE-3996
>>>
>>>
>>> On Mon, Apr 8, 2013 at 1:41 PM, Chad Urso McDaniel <chadum@gmail.com>wrote:
>>>
>>>> This may be a core hadoop question.
>>>>
>>>> We are using Crunch with HBase.
>>>> We typically set up the input PTable like so:
>>>> ---
>>>>       Scan scan = ...
>>>>       HBaseSourceTarget source = new HBaseSourceTarget(tableName, scan);
>>>>       PTable<ImmutableBytesWritable, Result> data =
>>>> pipeline.read(source);
>>>> ---
>>>>
>>>> A use case that we want to use in order to speed up the processing with
>>>> Crunch is using multiple Scans into one PTable.
>>>>
>>>> We know which sections of the HBase table we want and they are not
>>>> contiguous.
>>>>
>>>> We have tried unioning the PTables but that turns out to be incredibly
>>>> slow.
>>>> Currently we are using a filter that results in many unnecessary reads.
>>>>
>>>> How do others solve this?
>>>>
>>>> I'm temped to write a TableSource that can do this.
>>>>
>>>> thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message