crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: HBase & Crunch: multiple scans for a single PTable
Date Mon, 08 Apr 2013 20:47:05 GMT
Maybe we need something based on this?

https://issues.apache.org/jira/browse/HBASE-3996


On Mon, Apr 8, 2013 at 1:41 PM, Chad Urso McDaniel <chadum@gmail.com> wrote:

> This may be a core hadoop question.
>
> We are using Crunch with HBase.
> We typically set up the input PTable like so:
> ---
>       Scan scan = ...
>       HBaseSourceTarget source = new HBaseSourceTarget(tableName, scan);
>       PTable<ImmutableBytesWritable, Result> data = pipeline.read(source);
> ---
>
> A use case that we want to use in order to speed up the processing with
> Crunch is using multiple Scans into one PTable.
>
> We know which sections of the HBase table we want and they are not
> contiguous.
>
> We have tried unioning the PTables but that turns out to be incredibly
> slow.
> Currently we are using a filter that results in many unnecessary reads.
>
> How do others solve this?
>
> I'm temped to write a TableSource that can do this.
>
> thanks
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message