hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: How Grouping works for multiple groups
Date Tue, 20 May 2008 10:32:41 GMT
Conceptually the more we could capture from what users want to do as the
whole, the more clever query optimizer we can have. It is good if users can
construct the whole processing graph and process all at once but I feel
changing STORE from "do it right now" to "do it later" seems to be a bit
dodgy. Introducing the transaction-like syntax is OK but please make it
optional, meaning if we don't use, just do the way it is now. Some people
might still want just a few lines and go!!!

On backend side:-

1) The new execution engine design allows us to wire the plan as DAG but
I'm not sure if it executes by looking at DAG or just extracting a tree from
DAG.

2) We already have a disjoint union operator called POPackage for tagging
purpose.

I view this suggestion as "another pattern" for query optimizer. We
shouldn't enforce it but have to make it "possible to do".  (There is a
common issue in optimization. Sometimes different techniques just cannot
work together!!).
Pi



On 5/20/08, Olga Natkovich <olgan@yahoo-inc.com> wrote:
>
> I think we should introduce BEGIN ... EXECUTE {ALL} where
>
> BEGIN can be omitted and then assumed to be in the beginning of
> script/program/session.
> EXECUTE would mean "best effort execute" meaning we try to execute all
> and let user know what succeeded and what failed
> EXECUTE ALL would mean execute as transaction, aborting all on failure.
>
> Olga
>
> > -----Original Message-----
> > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > Sent: Monday, May 19, 2008 11:54 AM
> > To: pig-dev@incubator.apache.org
> > Subject: Re: How Grouping works for multiple groups
> >
> > Paolo had already suggested that we add an EXECUTE command
> > for exactly this purpose in interactive mode.
> >
> > Alan.
> >
> > Utkarsh Srivastava wrote:
> > > Yes, I agree, not introducing new syntax is much more preferable.
> > >
> > > Doing this optimization automatically for the batch mode is
> > a good idea.
> > > For the interactive mode, we would need something like a COMMIT
> > > statement, which will force execution (with execution not
> > > automatically starting on a STORE command as it currently does).
> > >
> > > As regards failure, we could start with our current model,
> > one failure
> > > fails everything.
> > >
> > > Utkarsh
> > >
> > >
> > >> -----Original Message-----
> > >> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > >> Sent: Monday, May 19, 2008 11:23 AM
> > >> To: pig-dev@incubator.apache.org
> > >> Subject: RE: How Grouping works for multiple groups
> > >>
> > >> Utkarsh,
> > >>
> > >> I agree that this issue has been brought up a number of times and
> > >>
> > > needs
> > >
> > >> to be addressed. I think it would be nice if we could address this
> > >> without introducing new syntax for store. In batch mode,
> > this would
> > >> be quite easy since we can build execution plan for the
> > entire script
> > >> rather than one store at a time. I realize that for
> > interactive and
> > >> embedded case it is a bit trickier. Also we need to
> > clarify what are
> > >>
> > > the
> > >
> > >> semantics of this kind of operation in the presence of failure. If
> > >> one store fails, what happens with the rest of the computation?
> > >>
> > >> Olga
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
> > >>> Sent: Monday, May 19, 2008 11:06 AM
> > >>> To: pig-dev@incubator.apache.org
> > >>> Subject: FW: How Grouping works for multiple groups
> > >>>
> > >>> Following is an email that showed up on the user-list. I am sure
> > >>> most people must have seen it.
> > >>>
> > >>> The guy wants to scan the data once and do multiple
> > things with it.
> > >>> This kind of a need arises often but we don't have a very good
> > >>> answer to it.
> > >>>
> > >>> We have SPLIT, but that is only half the solution (and
> > probably not
> > >>> a very good one).
> > >>>
> > >>> What is needed is more like a multi-store command (I
> > think someone
> > >>> has proposed it on one of these lists before).
> > >>>
> > >>> So you would be able to do things like
> > >>>
> > >>> A = LOAD ...
> > >>> B = FILTER A by ..
> > >>> C = FILTER A by ..
> > >>> //do something with B
> > >>> //do something else with C
> > >>> store B,C   <===== The new multi-store command
> > >>>
> > >>>
> > >>> Sawzall does better than us in this regard because they have
> > >>> collectors to which you can output data, and you can set
> > up as many
> > >>> collectors as you want.
> > >>>
> > >>> Utkarsh
> > >>>
> > >>> -----Original Message-----
> > >>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > >>> Sent: Monday, May 19, 2008 1:24 AM
> > >>> To: pig-user@incubator.apache.org
> > >>> Cc: Holsman, Ian
> > >>> Subject: How Grouping works for multiple groups
> > >>>
> > >>> Hi folks,
> > >>>              I am new to PIG having a little bit of Hadoop
> > >>> Map-reduce experience. I recently had chance to use PIG
> > for my data
> > >>> analysis task for which I had written a Map-Red program earlier.
> > >>> A few questions came up in my mind that I thought would be better
> > >>> asked in this forum. Here's a brief description of my
> > analysis task
> > >>> to give you an idea of what I am doing.
> > >>>
> > >>> - For each tuple I need to classify the data into 3 groups - A, B,
> > >>>
> > > C.
> > >
> > >>> - For group A and B,  I need to aggregate the number of distinct
> > >>>
> > > items
> > >
> > >>>   in each group and have them sorted in reverse order in
> > the output.
> > >>>
> > >>> - For group C, I only need to output those distinct items.
> > >>>
> > >>> - The output for each of these go to their respective
> > output files
> > >>> for e.g. A_file.txt, B_file.txt
> > >>>
> > >>>
> > >>> Now, it seems like in PIG's execution plan each 'Group'
> > >>> operation is a separate Map-Reduce job even though its
> > happening on
> > >>> the same set of tuples. Whereas writing a Map-Red job for
> > the same
> > >>> allows me to prefix a "Group identifier" of my choice to
> > the 'key'
> > >>> and produce the relevant 'value' data which I then use
> > subsequently
> > >>> in the combiner and reducer to perform the other operations and
> > >>> output to different files.
> > >>>
> > >>> If my understanding of PIG is correct then its execution plan is
> > >>> spawning multiple Map-Red jobs to scan the same data-set
> > again for
> > >>> different groups which is costlier than writing a custom
> > Map-red job
> > >>> and packing more work in a single Map-Red job the way I mentioned.
> > >>>
> > >>> I can always reduce the number of groups in my PIG scripts to
> > >>> 1 by having a user-defined function generating those
> > group prefixes
> > >>> before a group call and then do multiple filters on the
> > group 'key'
> > >>> again using a user-defined function that does group
> > identification
> > >>> but this is less than intuitive and requires more user-defined
> > >>> functions than one would like.
> > >>>
> > >>> My question is , Do current optimization techniques take care of
> > >>> such a scenario ? My observation is they don't, but I
> > could be wrong
> > >>> here. If they do then how can I have a peek into the
> > execution plan
> > >>> to make sure that its not spawning more than necessary number of
> > >>> Map-Red jobs.
> > >>>
> > >>> If they don't, then is it something planned for the future ?
> > >>>
> > >>> Also, I don't see 'Pig Pen' debugging environment anywhere ?
> > >>> Is it still a part of PIG, if yes then how can I use it ?
> > >>>
> > >>> I know its been a rather long mail, but any help here is deeply
> > >>> appreciated as going forward we plan to use PIG heavily to avoid
> > >>> writing custom Map-Red jobs for every different kind of analysis
> > >>> that we intend to do.
> > >>>
> > >>> Thanks and Regards
> > >>> -Ankur
> > >>>
> > >>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message