hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Utkarsh Srivastava" <utka...@yahoo-inc.com>
Subject FW: How Grouping works for multiple groups
Date Mon, 19 May 2008 18:06:06 GMT
Following is an email that showed up on the user-list. I am sure most
people must have seen it.

The guy wants to scan the data once and do multiple things with it. This
kind of a need arises often but we don't have a very good answer to it.

We have SPLIT, but that is only half the solution (and probably not a
very good one).

What is needed is more like a multi-store command (I think someone has
proposed it on one of these lists before).

So you would be able to do things like

A = LOAD ...
B = FILTER A by ..
C = FILTER A by ..
//do something with B
//do something else with C
store B,C   <===== The new multi-store command


Sawzall does better than us in this regard because they have collectors
to which you can output data, and you can set up as many collectors as
you want.

Utkarsh

-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Monday, May 19, 2008 1:24 AM
To: pig-user@incubator.apache.org
Cc: Holsman, Ian
Subject: How Grouping works for multiple groups

Hi folks,
             I am new to PIG having a little bit of Hadoop Map-reduce
experience. I recently had chance to use PIG for my data analysis task
for which I had written a Map-Red program earlier.
A few questions came up in my mind that I thought would be better asked
in this forum. Here's a brief description of my analysis task to give
you an idea of what I am doing.
 
- For each tuple I need to classify the data into 3 groups - A, B, C. 
 
- For group A and B,  I need to aggregate the number of distinct items 
  in each group and have them sorted in reverse order in the output.
 
- For group C, I only need to output those distinct items.
 
- The output for each of these go to their respective output files for
e.g. A_file.txt, B_file.txt 
 
 
Now, it seems like in PIG's execution plan each 'Group' operation is a
separate Map-Reduce job
even though its happening on the same set of tuples. Whereas writing a
Map-Red job for the same 
allows me to prefix a "Group identifier" of my choice to the 'key' and
produce the relevant 
'value' data which I then use subsequently in the combiner and reducer
to perform the other
operations and output to different files. 
 
If my understanding of PIG is correct then its execution plan is
spawning multiple Map-Red jobs
to scan the same data-set again for different groups which is costlier
than writing a custom Map-red 
job and packing more work in a single Map-Red job the way I mentioned.
 
I can always reduce the number of groups in my PIG scripts to 1 by
having a user-defined function
generating those group prefixes before a group call and then do multiple
filters on the group 'key' 
again using a user-defined function that does group identification but
this is less than intuitive and 
requires more user-defined functions than one would like.
 
My question is , Do current optimization techniques take care of such a
scenario ? My observation
is they don't, but I could be wrong here. If they do then how can I have
a peek into the execution plan 
to make sure that its not spawning more than necessary number of Map-Red
jobs.
 
If they don't, then is it something planned for the future ?
 
Also, I don't see 'Pig Pen' debugging environment anywhere ? Is it still
a part of PIG, if yes then how can
I use it ?
 
I know its been a rather long mail, but any help here is deeply
appreciated as going forward we plan to use
PIG heavily to avoid writing custom Map-Red jobs for every different
kind of analysis that we intend to do.
 
Thanks and Regards
-Ankur

Mime
View raw message