Mailing-List: contact accumulo-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: accumulo-user@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of trevoradams@gmail.com
 designates 209.85.220.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPMpPc6dUV1_cdpRQLOs30uj=62kbNcPZCVmOycOKQntd1QnHA@mail.gmail.com>
References: 
 <CADgy+JDEdSbrUZht4zu3V_OMRNW-9mo=pY0=yoBSADGoEfR46A@mail.gmail.com>
	<CAPMpPc6dUV1_cdpRQLOs30uj=62kbNcPZCVmOycOKQntd1QnHA@mail.gmail.com>
Date: Mon, 12 Dec 2011 12:09:40 -0500
Message-ID: 
 <CADgy+JBZ4hO7WqMZZs8-OYEaaCEoHrSnQ0PEYxwUqfm-C-nm6A@mail.gmail.com>
Subject: Re: Pig Tuples and Scanner
From: Trevor Adams <trevoradams@gmail.com>
To: accumulo-user@incubator.apache.org
Content-Type: multipart/alternative; boundary=bcaec547c9ed483ee204b3e833dd

--bcaec547c9ed483ee204b3e833dd
Content-Type: text/plain; charset=ISO-8859-1

Adam,

I can see how the large number of columns can be a problem but I also think
this would be an issue for HBase as well. While I understand it is a
different project this looks like it could be a problem for them as well.
Prior to the most recent version of pig (I believe this is when it was
added) you had to specify exact column names (cf:qual pairs), only recently
did they add support for grabbing an entire column family from a row. I
will explore my options a bit, and see what happens I guess. I will
probably start with the specific loader and then try to generalize from
there. Thanks

-Trevor

On Mon, Dec 12, 2011 at 9:10 AM, Adam Fuchs <adam.p.fuchs@ugov.gov> wrote:

> Trevor,
>
> I think there are a few different ways you could implement a LoadFunc on
> top of Accumulo. The most basic and universal option might be to use a
> single entry (Key/Value pair) as a Pig tuple. This is easier to code, but
> it might not correspond to your objects if you split your objects out into
> multiple columns, with one row per object. You might be able to use a Pig
> operator to group your data after using this type of LoadFunc, or you might
> want to create a more customized LoadFunc that understands more about how
> you organize your data in Accumulo.
>
> The second option that I've seen is the one that you describe -- namely
> iterating over the columns in each row to produce a row tuple. The way to
> do this is really just to loop over the elements returned by a Scanner and
> pinch off tuples when you see a new row ID. If you use the same splitting
> strategy that AccumuloInputFormat uses (i.e. pick split points from
> tablets) then rows will never be split across multiple splits. A single
> Scanner per RecordReader should work great for you, and I've seen other
> people implement LoadFunc successfully in that way.
>
> One thing to watch out for is that a single row in Accumulo could
> potentially have hundres of millions or even billions of columns. Using a
> row-based tuple for you LoadFunc could result in your application running
> out of memory if you try to process any arbitrary table. We commonly see
> this when looking at graph structures where edges are represented by
> columns. Zipfian distributions can make for some very big rows. This just
> means you have to be a bit careful about what you try to pull into a tuple.
>
> Cheers,
> Adam
>
>
> On Fri, Dec 9, 2011 at 4:50 PM, Trevor Adams <trevoradams@gmail.com>wrote:
>
>> So I am looking to create a LoadFunc for Accumulo, and am just wondering
>> what would be the "correct" way to do this, here is my current plan.
>>
>> Pig tuples are a set of columns for one given row in Accumulo, creating
>> the tuples with the Scanner seems possibly a bit odd. Loop over the
>> elements that it gives out (column value pairs) and fold/reduce on the
>> rowid and create some intermediate element that is used in a pseudo
>> InputFormat of <Row, ColVals> that can be used in the LoadFunc.
>>
>> Since I don't understand some of the stuff in Accumulo, there may be a
>> better way to accomplish the above. If there is, great, otherwise I will
>> begin on the above.
>>
>> -Trevor
>>
>
>

--bcaec547c9ed483ee204b3e833dd
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Adam,<br>=A0<br>I can see how the large number of columns can be a problem =
but I also think this would be an issue for HBase as well. While I understa=
nd it is a different project this looks like it could be a problem for them=
 as well. Prior to the most recent version of pig (I believe this is when i=
t was added) you had to specify exact column names (cf:qual pairs), only re=
cently did they add support for grabbing an entire column family from a row=
. I will explore my options a bit, and see what happens I guess. I will pro=
bably start with the specific loader and then try to generalize from there.=
 Thanks<br>
<br>-Trevor <br><br><div class=3D"gmail_quote">On Mon, Dec 12, 2011 at 9:10=
 AM, Adam Fuchs <span dir=3D"ltr">&lt;<a href=3D"mailto:adam.p.fuchs@ugov.g=
ov">adam.p.fuchs@ugov.gov</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">
Trevor,<div><br></div><div>I think there are a few different ways you could=
 implement a LoadFunc on top of Accumulo. The most basic and universal opti=
on might be to use a single entry (Key/Value pair) as a Pig tuple. This is =
easier to code, but it might not correspond to your objects if you split yo=
ur objects out into multiple columns, with one row per object. You might be=
 able to use a Pig operator to group your data after using this type of Loa=
dFunc, or you might want to create a more customized LoadFunc that understa=
nds more about how you organize your data in Accumulo.</div>

<div><br></div><div>The second option that I&#39;ve seen is the one that yo=
u describe -- namely iterating over the columns in each row to produce a ro=
w tuple. The way to do this is really just to loop over the elements return=
ed by a Scanner and pinch off tuples when you see a new row ID. If you use =
the same splitting strategy that AccumuloInputFormat uses (i.e. pick split =
points from tablets) then rows will never be split across multiple splits. =
A single Scanner per RecordReader should work great for you, and I&#39;ve s=
een other people implement LoadFunc successfully in that way.</div>

<div><br></div><div>One thing to watch out for is that a single row in Accu=
mulo could potentially have hundres of millions or even billions of columns=
. Using a row-based tuple for you LoadFunc could result in your application=
 running out of memory if you try to process any arbitrary table. We common=
ly see this when looking at graph structures where edges are represented by=
 columns. Zipfian distributions can make for some very big rows. This just =
means you have to be a bit careful about what you try to pull into a tuple.=
</div>

<div><br></div><div>Cheers,</div><div>Adam</div><div><br><br><div class=3D"=
gmail_quote"><div class=3D"im">On Fri, Dec 9, 2011 at 4:50 PM, Trevor Adams=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:trevoradams@gmail.com" target=3D"_=
blank">trevoradams@gmail.com</a>&gt;</span> wrote:<br>

</div><div><div class=3D"h5"><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">So I am looking=
 to create a LoadFunc for Accumulo, and am just wondering what would be the=
 &quot;correct&quot; way to do this, here is my current plan. <br>

<br>Pig tuples are a set of columns for one given row in Accumulo, creating=
 the tuples with the Scanner seems possibly a bit odd. Loop over the elemen=
ts that it gives out (column value pairs) and fold/reduce on the rowid and =
create some intermediate element that is used in a pseudo InputFormat of &l=
t;Row, ColVals&gt; that can be used in the LoadFunc.<br>


<br>Since I don&#39;t understand some of the stuff in Accumulo, there may b=
e a better way to accomplish the above. If there is, great, otherwise I wil=
l begin on the above.<span><font color=3D"#888888"><br><br>
-Trevor<br>
</font></span></blockquote></div></div></div><br></div>
</blockquote></div><br>

--bcaec547c9ed483ee204b3e833dd--