incubator-accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <adam.p.fu...@ugov.gov>
Subject Re: Pig Tuples and Scanner
Date Mon, 12 Dec 2011 14:10:26 GMT
Trevor,

I think there are a few different ways you could implement a LoadFunc on
top of Accumulo. The most basic and universal option might be to use a
single entry (Key/Value pair) as a Pig tuple. This is easier to code, but
it might not correspond to your objects if you split your objects out into
multiple columns, with one row per object. You might be able to use a Pig
operator to group your data after using this type of LoadFunc, or you might
want to create a more customized LoadFunc that understands more about how
you organize your data in Accumulo.

The second option that I've seen is the one that you describe -- namely
iterating over the columns in each row to produce a row tuple. The way to
do this is really just to loop over the elements returned by a Scanner and
pinch off tuples when you see a new row ID. If you use the same splitting
strategy that AccumuloInputFormat uses (i.e. pick split points from
tablets) then rows will never be split across multiple splits. A single
Scanner per RecordReader should work great for you, and I've seen other
people implement LoadFunc successfully in that way.

One thing to watch out for is that a single row in Accumulo could
potentially have hundres of millions or even billions of columns. Using a
row-based tuple for you LoadFunc could result in your application running
out of memory if you try to process any arbitrary table. We commonly see
this when looking at graph structures where edges are represented by
columns. Zipfian distributions can make for some very big rows. This just
means you have to be a bit careful about what you try to pull into a tuple.

Cheers,
Adam


On Fri, Dec 9, 2011 at 4:50 PM, Trevor Adams <trevoradams@gmail.com> wrote:

> So I am looking to create a LoadFunc for Accumulo, and am just wondering
> what would be the "correct" way to do this, here is my current plan.
>
> Pig tuples are a set of columns for one given row in Accumulo, creating
> the tuples with the Scanner seems possibly a bit odd. Loop over the
> elements that it gives out (column value pairs) and fold/reduce on the
> rowid and create some intermediate element that is used in a pseudo
> InputFormat of <Row, ColVals> that can be used in the LoadFunc.
>
> Since I don't understand some of the stuff in Accumulo, there may be a
> better way to accomplish the above. If there is, great, otherwise I will
> begin on the above.
>
> -Trevor
>

Mime
View raw message