incubator-accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Adams <trevorad...@gmail.com>
Subject Re: Pig Tuples and Scanner
Date Mon, 12 Dec 2011 17:09:40 GMT
Adam,

I can see how the large number of columns can be a problem but I also think
this would be an issue for HBase as well. While I understand it is a
different project this looks like it could be a problem for them as well.
Prior to the most recent version of pig (I believe this is when it was
added) you had to specify exact column names (cf:qual pairs), only recently
did they add support for grabbing an entire column family from a row. I
will explore my options a bit, and see what happens I guess. I will
probably start with the specific loader and then try to generalize from
there. Thanks

-Trevor

On Mon, Dec 12, 2011 at 9:10 AM, Adam Fuchs <adam.p.fuchs@ugov.gov> wrote:

> Trevor,
>
> I think there are a few different ways you could implement a LoadFunc on
> top of Accumulo. The most basic and universal option might be to use a
> single entry (Key/Value pair) as a Pig tuple. This is easier to code, but
> it might not correspond to your objects if you split your objects out into
> multiple columns, with one row per object. You might be able to use a Pig
> operator to group your data after using this type of LoadFunc, or you might
> want to create a more customized LoadFunc that understands more about how
> you organize your data in Accumulo.
>
> The second option that I've seen is the one that you describe -- namely
> iterating over the columns in each row to produce a row tuple. The way to
> do this is really just to loop over the elements returned by a Scanner and
> pinch off tuples when you see a new row ID. If you use the same splitting
> strategy that AccumuloInputFormat uses (i.e. pick split points from
> tablets) then rows will never be split across multiple splits. A single
> Scanner per RecordReader should work great for you, and I've seen other
> people implement LoadFunc successfully in that way.
>
> One thing to watch out for is that a single row in Accumulo could
> potentially have hundres of millions or even billions of columns. Using a
> row-based tuple for you LoadFunc could result in your application running
> out of memory if you try to process any arbitrary table. We commonly see
> this when looking at graph structures where edges are represented by
> columns. Zipfian distributions can make for some very big rows. This just
> means you have to be a bit careful about what you try to pull into a tuple.
>
> Cheers,
> Adam
>
>
> On Fri, Dec 9, 2011 at 4:50 PM, Trevor Adams <trevoradams@gmail.com>wrote:
>
>> So I am looking to create a LoadFunc for Accumulo, and am just wondering
>> what would be the "correct" way to do this, here is my current plan.
>>
>> Pig tuples are a set of columns for one given row in Accumulo, creating
>> the tuples with the Scanner seems possibly a bit odd. Loop over the
>> elements that it gives out (column value pairs) and fold/reduce on the
>> rowid and create some intermediate element that is used in a pseudo
>> InputFormat of <Row, ColVals> that can be used in the LoadFunc.
>>
>> Since I don't understand some of the stuff in Accumulo, there may be a
>> better way to accomplish the above. If there is, great, otherwise I will
>> begin on the above.
>>
>> -Trevor
>>
>
>

Mime
View raw message