hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich" <ol...@yahoo-inc.com>
Subject RE: How Grouping works for multiple groups
Date Tue, 20 May 2008 17:01:51 GMT
Thanks, Pi. Yes, I totally agree that this would be optional.

Olga 

> -----Original Message-----
> From: pi song [mailto:pi.songs@gmail.com] 
> Sent: Tuesday, May 20, 2008 3:33 AM
> To: pig-dev@incubator.apache.org
> Subject: Re: How Grouping works for multiple groups
> 
> Conceptually the more we could capture from what users want 
> to do as the whole, the more clever query optimizer we can 
> have. It is good if users can construct the whole processing 
> graph and process all at once but I feel changing STORE from 
> "do it right now" to "do it later" seems to be a bit dodgy. 
> Introducing the transaction-like syntax is OK but please make 
> it optional, meaning if we don't use, just do the way it is 
> now. Some people might still want just a few lines and go!!!
> 
> On backend side:-
> 
> 1) The new execution engine design allows us to wire the plan 
> as DAG but I'm not sure if it executes by looking at DAG or 
> just extracting a tree from DAG.
> 
> 2) We already have a disjoint union operator called POPackage 
> for tagging purpose.
> 
> I view this suggestion as "another pattern" for query 
> optimizer. We shouldn't enforce it but have to make it 
> "possible to do".  (There is a common issue in optimization. 
> Sometimes different techniques just cannot work together!!).
> Pi
> 
> 
> 
> On 5/20/08, Olga Natkovich <olgan@yahoo-inc.com> wrote:
> >
> > I think we should introduce BEGIN ... EXECUTE {ALL} where
> >
> > BEGIN can be omitted and then assumed to be in the beginning of 
> > script/program/session.
> > EXECUTE would mean "best effort execute" meaning we try to 
> execute all 
> > and let user know what succeeded and what failed EXECUTE ALL would 
> > mean execute as transaction, aborting all on failure.
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > > Sent: Monday, May 19, 2008 11:54 AM
> > > To: pig-dev@incubator.apache.org
> > > Subject: Re: How Grouping works for multiple groups
> > >
> > > Paolo had already suggested that we add an EXECUTE command for 
> > > exactly this purpose in interactive mode.
> > >
> > > Alan.
> > >
> > > Utkarsh Srivastava wrote:
> > > > Yes, I agree, not introducing new syntax is much more 
> preferable.
> > > >
> > > > Doing this optimization automatically for the batch mode is
> > > a good idea.
> > > > For the interactive mode, we would need something like a COMMIT 
> > > > statement, which will force execution (with execution not 
> > > > automatically starting on a STORE command as it currently does).
> > > >
> > > > As regards failure, we could start with our current model,
> > > one failure
> > > > fails everything.
> > > >
> > > > Utkarsh
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > > >> Sent: Monday, May 19, 2008 11:23 AM
> > > >> To: pig-dev@incubator.apache.org
> > > >> Subject: RE: How Grouping works for multiple groups
> > > >>
> > > >> Utkarsh,
> > > >>
> > > >> I agree that this issue has been brought up a number 
> of times and
> > > >>
> > > > needs
> > > >
> > > >> to be addressed. I think it would be nice if we could address 
> > > >> this without introducing new syntax for store. In batch mode,
> > > this would
> > > >> be quite easy since we can build execution plan for the
> > > entire script
> > > >> rather than one store at a time. I realize that for
> > > interactive and
> > > >> embedded case it is a bit trickier. Also we need to
> > > clarify what are
> > > >>
> > > > the
> > > >
> > > >> semantics of this kind of operation in the presence of 
> failure. 
> > > >> If one store fails, what happens with the rest of the 
> computation?
> > > >>
> > > >> Olga
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
> > > >>> Sent: Monday, May 19, 2008 11:06 AM
> > > >>> To: pig-dev@incubator.apache.org
> > > >>> Subject: FW: How Grouping works for multiple groups
> > > >>>
> > > >>> Following is an email that showed up on the 
> user-list. I am sure 
> > > >>> most people must have seen it.
> > > >>>
> > > >>> The guy wants to scan the data once and do multiple
> > > things with it.
> > > >>> This kind of a need arises often but we don't have a 
> very good 
> > > >>> answer to it.
> > > >>>
> > > >>> We have SPLIT, but that is only half the solution (and
> > > probably not
> > > >>> a very good one).
> > > >>>
> > > >>> What is needed is more like a multi-store command (I
> > > think someone
> > > >>> has proposed it on one of these lists before).
> > > >>>
> > > >>> So you would be able to do things like
> > > >>>
> > > >>> A = LOAD ...
> > > >>> B = FILTER A by ..
> > > >>> C = FILTER A by ..
> > > >>> //do something with B
> > > >>> //do something else with C
> > > >>> store B,C   <===== The new multi-store command
> > > >>>
> > > >>>
> > > >>> Sawzall does better than us in this regard because they have 
> > > >>> collectors to which you can output data, and you can set
> > > up as many
> > > >>> collectors as you want.
> > > >>>
> > > >>> Utkarsh
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > > >>> Sent: Monday, May 19, 2008 1:24 AM
> > > >>> To: pig-user@incubator.apache.org
> > > >>> Cc: Holsman, Ian
> > > >>> Subject: How Grouping works for multiple groups
> > > >>>
> > > >>> Hi folks,
> > > >>>              I am new to PIG having a little bit of Hadoop 
> > > >>> Map-reduce experience. I recently had chance to use PIG
> > > for my data
> > > >>> analysis task for which I had written a Map-Red 
> program earlier.
> > > >>> A few questions came up in my mind that I thought would be 
> > > >>> better asked in this forum. Here's a brief description of my
> > > analysis task
> > > >>> to give you an idea of what I am doing.
> > > >>>
> > > >>> - For each tuple I need to classify the data into 3 
> groups - A, 
> > > >>> B,
> > > >>>
> > > > C.
> > > >
> > > >>> - For group A and B,  I need to aggregate the number 
> of distinct
> > > >>>
> > > > items
> > > >
> > > >>>   in each group and have them sorted in reverse order in
> > > the output.
> > > >>>
> > > >>> - For group C, I only need to output those distinct items.
> > > >>>
> > > >>> - The output for each of these go to their respective
> > > output files
> > > >>> for e.g. A_file.txt, B_file.txt
> > > >>>
> > > >>>
> > > >>> Now, it seems like in PIG's execution plan each 'Group'
> > > >>> operation is a separate Map-Reduce job even though its
> > > happening on
> > > >>> the same set of tuples. Whereas writing a Map-Red job for
> > > the same
> > > >>> allows me to prefix a "Group identifier" of my choice to
> > > the 'key'
> > > >>> and produce the relevant 'value' data which I then use
> > > subsequently
> > > >>> in the combiner and reducer to perform the other 
> operations and 
> > > >>> output to different files.
> > > >>>
> > > >>> If my understanding of PIG is correct then its 
> execution plan is 
> > > >>> spawning multiple Map-Red jobs to scan the same data-set
> > > again for
> > > >>> different groups which is costlier than writing a custom
> > > Map-red job
> > > >>> and packing more work in a single Map-Red job the way 
> I mentioned.
> > > >>>
> > > >>> I can always reduce the number of groups in my PIG scripts to
> > > >>> 1 by having a user-defined function generating those
> > > group prefixes
> > > >>> before a group call and then do multiple filters on the
> > > group 'key'
> > > >>> again using a user-defined function that does group
> > > identification
> > > >>> but this is less than intuitive and requires more 
> user-defined 
> > > >>> functions than one would like.
> > > >>>
> > > >>> My question is , Do current optimization techniques 
> take care of 
> > > >>> such a scenario ? My observation is they don't, but I
> > > could be wrong
> > > >>> here. If they do then how can I have a peek into the
> > > execution plan
> > > >>> to make sure that its not spawning more than 
> necessary number of 
> > > >>> Map-Red jobs.
> > > >>>
> > > >>> If they don't, then is it something planned for the future ?
> > > >>>
> > > >>> Also, I don't see 'Pig Pen' debugging environment anywhere ?
> > > >>> Is it still a part of PIG, if yes then how can I use it ?
> > > >>>
> > > >>> I know its been a rather long mail, but any help here 
> is deeply 
> > > >>> appreciated as going forward we plan to use PIG 
> heavily to avoid 
> > > >>> writing custom Map-Red jobs for every different kind 
> of analysis 
> > > >>> that we intend to do.
> > > >>>
> > > >>> Thanks and Regards
> > > >>> -Ankur
> > > >>>
> > > >>>
> > >
> >
> 

Mime
View raw message