Return-Path: X-Original-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2AE0596B5 for ; Mon, 12 Dec 2011 17:10:12 +0000 (UTC) Received: (qmail 79406 invoked by uid 500); 12 Dec 2011 17:10:12 -0000 Delivered-To: apmail-incubator-accumulo-user-archive@incubator.apache.org Received: (qmail 79381 invoked by uid 500); 12 Dec 2011 17:10:12 -0000 Mailing-List: contact accumulo-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: accumulo-user@incubator.apache.org Delivered-To: mailing list accumulo-user@incubator.apache.org Received: (qmail 79373 invoked by uid 99); 12 Dec 2011 17:10:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Dec 2011 17:10:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of trevoradams@gmail.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vx0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Dec 2011 17:10:01 +0000 Received: by vcbfo13 with SMTP id fo13so4036016vcb.6 for ; Mon, 12 Dec 2011 09:09:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=gALoplJoJrOpJ7p8DGoypAAWlUbm3dqdRbnb2XbRabw=; b=rzfxzm41HkbB1FawvBIxn0nSKdDEwrvxEm8TwxhZDYL3+aaTRV5VBI32gC5xbDpEFs LazpN93O0z8GDNXGCJwKbxdGh1oN6t+VnXP7q7Ad0dFzlJJCREhGDL0PhVRkfa8CT00G TJXiwDNSCABR0971PsqQYNfvh+APSa5agNBe4= MIME-Version: 1.0 Received: by 10.52.186.225 with SMTP id fn1mr10397868vdc.32.1323709780999; Mon, 12 Dec 2011 09:09:40 -0800 (PST) Received: by 10.220.117.20 with HTTP; Mon, 12 Dec 2011 09:09:40 -0800 (PST) In-Reply-To: References: Date: Mon, 12 Dec 2011 12:09:40 -0500 Message-ID: Subject: Re: Pig Tuples and Scanner From: Trevor Adams To: accumulo-user@incubator.apache.org Content-Type: multipart/alternative; boundary=bcaec547c9ed483ee204b3e833dd X-Virus-Checked: Checked by ClamAV on apache.org --bcaec547c9ed483ee204b3e833dd Content-Type: text/plain; charset=ISO-8859-1 Adam, I can see how the large number of columns can be a problem but I also think this would be an issue for HBase as well. While I understand it is a different project this looks like it could be a problem for them as well. Prior to the most recent version of pig (I believe this is when it was added) you had to specify exact column names (cf:qual pairs), only recently did they add support for grabbing an entire column family from a row. I will explore my options a bit, and see what happens I guess. I will probably start with the specific loader and then try to generalize from there. Thanks -Trevor On Mon, Dec 12, 2011 at 9:10 AM, Adam Fuchs wrote: > Trevor, > > I think there are a few different ways you could implement a LoadFunc on > top of Accumulo. The most basic and universal option might be to use a > single entry (Key/Value pair) as a Pig tuple. This is easier to code, but > it might not correspond to your objects if you split your objects out into > multiple columns, with one row per object. You might be able to use a Pig > operator to group your data after using this type of LoadFunc, or you might > want to create a more customized LoadFunc that understands more about how > you organize your data in Accumulo. > > The second option that I've seen is the one that you describe -- namely > iterating over the columns in each row to produce a row tuple. The way to > do this is really just to loop over the elements returned by a Scanner and > pinch off tuples when you see a new row ID. If you use the same splitting > strategy that AccumuloInputFormat uses (i.e. pick split points from > tablets) then rows will never be split across multiple splits. A single > Scanner per RecordReader should work great for you, and I've seen other > people implement LoadFunc successfully in that way. > > One thing to watch out for is that a single row in Accumulo could > potentially have hundres of millions or even billions of columns. Using a > row-based tuple for you LoadFunc could result in your application running > out of memory if you try to process any arbitrary table. We commonly see > this when looking at graph structures where edges are represented by > columns. Zipfian distributions can make for some very big rows. This just > means you have to be a bit careful about what you try to pull into a tuple. > > Cheers, > Adam > > > On Fri, Dec 9, 2011 at 4:50 PM, Trevor Adams wrote: > >> So I am looking to create a LoadFunc for Accumulo, and am just wondering >> what would be the "correct" way to do this, here is my current plan. >> >> Pig tuples are a set of columns for one given row in Accumulo, creating >> the tuples with the Scanner seems possibly a bit odd. Loop over the >> elements that it gives out (column value pairs) and fold/reduce on the >> rowid and create some intermediate element that is used in a pseudo >> InputFormat of that can be used in the LoadFunc. >> >> Since I don't understand some of the stuff in Accumulo, there may be a >> better way to accomplish the above. If there is, great, otherwise I will >> begin on the above. >> >> -Trevor >> > > --bcaec547c9ed483ee204b3e833dd Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Adam,
=A0
I can see how the large number of columns can be a problem = but I also think this would be an issue for HBase as well. While I understa= nd it is a different project this looks like it could be a problem for them= as well. Prior to the most recent version of pig (I believe this is when i= t was added) you had to specify exact column names (cf:qual pairs), only re= cently did they add support for grabbing an entire column family from a row= . I will explore my options a bit, and see what happens I guess. I will pro= bably start with the specific loader and then try to generalize from there.= Thanks

-Trevor

On Mon, Dec 12, 2011 at 9:10= AM, Adam Fuchs <adam.p.fuchs@ugov.gov> wrote:
Trevor,

I think there are a few different ways you could= implement a LoadFunc on top of Accumulo. The most basic and universal opti= on might be to use a single entry (Key/Value pair) as a Pig tuple. This is = easier to code, but it might not correspond to your objects if you split yo= ur objects out into multiple columns, with one row per object. You might be= able to use a Pig operator to group your data after using this type of Loa= dFunc, or you might want to create a more customized LoadFunc that understa= nds more about how you organize your data in Accumulo.

The second option that I've seen is the one that yo= u describe -- namely iterating over the columns in each row to produce a ro= w tuple. The way to do this is really just to loop over the elements return= ed by a Scanner and pinch off tuples when you see a new row ID. If you use = the same splitting strategy that AccumuloInputFormat uses (i.e. pick split = points from tablets) then rows will never be split across multiple splits. = A single Scanner per RecordReader should work great for you, and I've s= een other people implement LoadFunc successfully in that way.

One thing to watch out for is that a single row in Accu= mulo could potentially have hundres of millions or even billions of columns= . Using a row-based tuple for you LoadFunc could result in your application= running out of memory if you try to process any arbitrary table. We common= ly see this when looking at graph structures where edges are represented by= columns. Zipfian distributions can make for some very big rows. This just = means you have to be a bit careful about what you try to pull into a tuple.=

Cheers,
Adam


On Fri, Dec 9, 2011 at 4:50 PM, Trevor Adams= <trevoradams@gmail.com> wrote:
So I am looking= to create a LoadFunc for Accumulo, and am just wondering what would be the= "correct" way to do this, here is my current plan.

Pig tuples are a set of columns for one given row in Accumulo, creating= the tuples with the Scanner seems possibly a bit odd. Loop over the elemen= ts that it gives out (column value pairs) and fold/reduce on the rowid and = create some intermediate element that is used in a pseudo InputFormat of &l= t;Row, ColVals> that can be used in the LoadFunc.

Since I don't understand some of the stuff in Accumulo, there may b= e a better way to accomplish the above. If there is, great, otherwise I wil= l begin on the above.

-Trevor


--bcaec547c9ed483ee204b3e833dd--