hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: How Grouping works for multiple groups
Date Mon, 19 May 2008 18:53:44 GMT
Paolo had already suggested that we add an EXECUTE command for exactly 
this purpose in interactive mode.


Utkarsh Srivastava wrote:
> Yes, I agree, not introducing new syntax is much more preferable. 
> Doing this optimization automatically for the batch mode is a good idea.
> For the interactive mode, we would need something like a COMMIT
> statement, which will force execution (with execution not automatically
> starting on a STORE command as it currently does).
> As regards failure, we could start with our current model, one failure
> fails everything.
> Utkarsh
>> -----Original Message-----
>> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
>> Sent: Monday, May 19, 2008 11:23 AM
>> To: pig-dev@incubator.apache.org
>> Subject: RE: How Grouping works for multiple groups
>> Utkarsh,
>> I agree that this issue has been brought up a number of times and
> needs
>> to be addressed. I think it would be nice if we could address this
>> without introducing new syntax for store. In batch mode, this would be
>> quite easy since we can build execution plan for the entire script
>> rather than one store at a time. I realize that for interactive and
>> embedded case it is a bit trickier. Also we need to clarify what are
> the
>> semantics of this kind of operation in the presence of failure. If one
>> store fails, what happens with the rest of the computation?
>> Olga
>>> -----Original Message-----
>>> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
>>> Sent: Monday, May 19, 2008 11:06 AM
>>> To: pig-dev@incubator.apache.org
>>> Subject: FW: How Grouping works for multiple groups
>>> Following is an email that showed up on the user-list. I am
>>> sure most people must have seen it.
>>> The guy wants to scan the data once and do multiple things
>>> with it. This kind of a need arises often but we don't have a
>>> very good answer to it.
>>> We have SPLIT, but that is only half the solution (and
>>> probably not a very good one).
>>> What is needed is more like a multi-store command (I think
>>> someone has proposed it on one of these lists before).
>>> So you would be able to do things like
>>> A = LOAD ...
>>> B = FILTER A by ..
>>> C = FILTER A by ..
>>> //do something with B
>>> //do something else with C
>>> store B,C   <===== The new multi-store command
>>> Sawzall does better than us in this regard because they have
>>> collectors to which you can output data, and you can set up
>>> as many collectors as you want.
>>> Utkarsh
>>> -----Original Message-----
>>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
>>> Sent: Monday, May 19, 2008 1:24 AM
>>> To: pig-user@incubator.apache.org
>>> Cc: Holsman, Ian
>>> Subject: How Grouping works for multiple groups
>>> Hi folks,
>>>              I am new to PIG having a little bit of Hadoop
>>> Map-reduce experience. I recently had chance to use PIG for
>>> my data analysis task for which I had written a Map-Red
>>> program earlier.
>>> A few questions came up in my mind that I thought would be
>>> better asked in this forum. Here's a brief description of my
>>> analysis task to give you an idea of what I am doing.
>>> - For each tuple I need to classify the data into 3 groups - A, B,
> C.
>>> - For group A and B,  I need to aggregate the number of distinct
> items
>>>   in each group and have them sorted in reverse order in the output.
>>> - For group C, I only need to output those distinct items.
>>> - The output for each of these go to their respective output
>>> files for e.g. A_file.txt, B_file.txt
>>> Now, it seems like in PIG's execution plan each 'Group'
>>> operation is a separate Map-Reduce job even though its
>>> happening on the same set of tuples. Whereas writing a
>>> Map-Red job for the same allows me to prefix a "Group
>>> identifier" of my choice to the 'key' and produce the
>>> relevant 'value' data which I then use subsequently in the
>>> combiner and reducer to perform the other operations and
>>> output to different files.
>>> If my understanding of PIG is correct then its execution plan
>>> is spawning multiple Map-Red jobs to scan the same data-set
>>> again for different groups which is costlier than writing a
>>> custom Map-red job and packing more work in a single Map-Red
>>> job the way I mentioned.
>>> I can always reduce the number of groups in my PIG scripts to
>>> 1 by having a user-defined function generating those group
>>> prefixes before a group call and then do multiple filters on
>>> the group 'key'
>>> again using a user-defined function that does group
>>> identification but this is less than intuitive and requires
>>> more user-defined functions than one would like.
>>> My question is , Do current optimization techniques take care
>>> of such a scenario ? My observation is they don't, but I
>>> could be wrong here. If they do then how can I have a peek
>>> into the execution plan to make sure that its not spawning
>>> more than necessary number of Map-Red jobs.
>>> If they don't, then is it something planned for the future ?
>>> Also, I don't see 'Pig Pen' debugging environment anywhere ?
>>> Is it still a part of PIG, if yes then how can I use it ?
>>> I know its been a rather long mail, but any help here is
>>> deeply appreciated as going forward we plan to use PIG
>>> heavily to avoid writing custom Map-Red jobs for every
>>> different kind of analysis that we intend to do.
>>> Thanks and Regards
>>> -Ankur

View raw message