hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Utkarsh Srivastava" <utka...@yahoo-inc.com>
Subject RE: How Grouping works for multiple groups
Date Mon, 19 May 2008 18:38:03 GMT
Yes, I agree, not introducing new syntax is much more preferable. 

Doing this optimization automatically for the batch mode is a good idea.
For the interactive mode, we would need something like a COMMIT
statement, which will force execution (with execution not automatically
starting on a STORE command as it currently does).

As regards failure, we could start with our current model, one failure
fails everything.

Utkarsh

> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, May 19, 2008 11:23 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: How Grouping works for multiple groups
> 
> Utkarsh,
> 
> I agree that this issue has been brought up a number of times and
needs
> to be addressed. I think it would be nice if we could address this
> without introducing new syntax for store. In batch mode, this would be
> quite easy since we can build execution plan for the entire script
> rather than one store at a time. I realize that for interactive and
> embedded case it is a bit trickier. Also we need to clarify what are
the
> semantics of this kind of operation in the presence of failure. If one
> store fails, what happens with the rest of the computation?
> 
> Olga
> 
> > -----Original Message-----
> > From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
> > Sent: Monday, May 19, 2008 11:06 AM
> > To: pig-dev@incubator.apache.org
> > Subject: FW: How Grouping works for multiple groups
> >
> > Following is an email that showed up on the user-list. I am
> > sure most people must have seen it.
> >
> > The guy wants to scan the data once and do multiple things
> > with it. This kind of a need arises often but we don't have a
> > very good answer to it.
> >
> > We have SPLIT, but that is only half the solution (and
> > probably not a very good one).
> >
> > What is needed is more like a multi-store command (I think
> > someone has proposed it on one of these lists before).
> >
> > So you would be able to do things like
> >
> > A = LOAD ...
> > B = FILTER A by ..
> > C = FILTER A by ..
> > //do something with B
> > //do something else with C
> > store B,C   <===== The new multi-store command
> >
> >
> > Sawzall does better than us in this regard because they have
> > collectors to which you can output data, and you can set up
> > as many collectors as you want.
> >
> > Utkarsh
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > Sent: Monday, May 19, 2008 1:24 AM
> > To: pig-user@incubator.apache.org
> > Cc: Holsman, Ian
> > Subject: How Grouping works for multiple groups
> >
> > Hi folks,
> >              I am new to PIG having a little bit of Hadoop
> > Map-reduce experience. I recently had chance to use PIG for
> > my data analysis task for which I had written a Map-Red
> > program earlier.
> > A few questions came up in my mind that I thought would be
> > better asked in this forum. Here's a brief description of my
> > analysis task to give you an idea of what I am doing.
> >
> > - For each tuple I need to classify the data into 3 groups - A, B,
C.
> >
> > - For group A and B,  I need to aggregate the number of distinct
items
> >   in each group and have them sorted in reverse order in the output.
> >
> > - For group C, I only need to output those distinct items.
> >
> > - The output for each of these go to their respective output
> > files for e.g. A_file.txt, B_file.txt
> >
> >
> > Now, it seems like in PIG's execution plan each 'Group'
> > operation is a separate Map-Reduce job even though its
> > happening on the same set of tuples. Whereas writing a
> > Map-Red job for the same allows me to prefix a "Group
> > identifier" of my choice to the 'key' and produce the
> > relevant 'value' data which I then use subsequently in the
> > combiner and reducer to perform the other operations and
> > output to different files.
> >
> > If my understanding of PIG is correct then its execution plan
> > is spawning multiple Map-Red jobs to scan the same data-set
> > again for different groups which is costlier than writing a
> > custom Map-red job and packing more work in a single Map-Red
> > job the way I mentioned.
> >
> > I can always reduce the number of groups in my PIG scripts to
> > 1 by having a user-defined function generating those group
> > prefixes before a group call and then do multiple filters on
> > the group 'key'
> > again using a user-defined function that does group
> > identification but this is less than intuitive and requires
> > more user-defined functions than one would like.
> >
> > My question is , Do current optimization techniques take care
> > of such a scenario ? My observation is they don't, but I
> > could be wrong here. If they do then how can I have a peek
> > into the execution plan to make sure that its not spawning
> > more than necessary number of Map-Red jobs.
> >
> > If they don't, then is it something planned for the future ?
> >
> > Also, I don't see 'Pig Pen' debugging environment anywhere ?
> > Is it still a part of PIG, if yes then how can I use it ?
> >
> > I know its been a rather long mail, but any help here is
> > deeply appreciated as going forward we plan to use PIG
> > heavily to avoid writing custom Map-Red jobs for every
> > different kind of analysis that we intend to do.
> >
> > Thanks and Regards
> > -Ankur
> >

Mime
View raw message