crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Urso McDaniel <cha...@gmail.com>
Subject HBase & Crunch: multiple scans for a single PTable
Date Mon, 08 Apr 2013 20:41:11 GMT
This may be a core hadoop question.

We are using Crunch with HBase.
We typically set up the input PTable like so:
---
      Scan scan = ...
      HBaseSourceTarget source = new HBaseSourceTarget(tableName, scan);
      PTable<ImmutableBytesWritable, Result> data = pipeline.read(source);
---

A use case that we want to use in order to speed up the processing with
Crunch is using multiple Scans into one PTable.

We know which sections of the HBase table we want and they are not
contiguous.

We have tried unioning the PTables but that turns out to be incredibly slow.
Currently we are using a filter that results in many unnecessary reads.

How do others solve this?

I'm temped to write a TableSource that can do this.

thanks

Mime
View raw message