hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <te...@yahoo-inc.com>
Subject Re: [Pig Wiki] Update of "ProposedProjects" by AlanGates
Date Thu, 16 Apr 2009 17:14:16 GMT
This paper seems very relevant to the proposal -
"Compiled Query Execution Engine using JVM"
http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.40

>From the abstract -
"Our experimental results on the TPC-H data set show that, despite both
engines benefiting from JIT, the compiled engine runs on average about twice
as fast as the interpreted one, and significantly faster than an in-memory"

(I don't have access to the full paper though).

-Thejas


On 4/16/09 9:26 AM, "Alan Gates" <gates@yahoo-inc.com> wrote:

> Your understanding of the proposal is correct.  The goal would be to
> produce Java code rather than a pipeline configuration.  But the
> reasoning is not so that users can then take that and modify
> themselves.  There's nothing preventing them from doing it, but it has
> a couple of major drawbacks.
> 
> 1) Code generators generally generate horrific looking code, because
> they are going for speed and compactness not human maintainability.
> Trying to work in that code would be very difficult.
> 
> 2) If you start adding code to generated code, you can no longer use
> the original Pig Latin.  You are from that point forward stuck in
> Java, since you can't backport your Java into the Pig Latin.
> 
> The proposal is designed to test the performance of Pig based on
> generated Java (or for that matter any other language, it need not be
> Java).  For the idea you suggest, the NATIVE keyword (proposed here
> https://issues.apache.org/jira/browse/PIG-506)
>   is a better solution.
> 
> Alan.
> 
> On Apr 16, 2009, at 12:54 AM, nitesh bhatia wrote:
> 
>> Hi
>> Can you briefly explain what is required in the first project? After
>> reading
>> the description my impression is, currently when we are executing
>> commands
>> on Pig Shell, Pig is first converting to map-reduce jobs and then
>> feeding it
>> to hadoop. In this project are we proposing that, the execution plan
>> made by
>> Pig will be first converted to a java file for map-reduce procedure
>> and then
>> feed onto hadoop network ?
>> 
>> If this is the case then I am sure it will be great help to users as
>> this
>> functionality can be used to write complicated map-reduce jobs very
>> easily.
>> Initially user can write the Pig scripts / commands required for his
>> job and
>> get the map-reduce java files. Then he can edit map-reduce files to
>> extend
>> the functionality  and add extra procedures that are not provided by
>> Pig but
>> can be executed over hadoop.
>> 
>> --nitesh
>> 
>> On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki <wikidiffs@apache.org>
>> wrote:
>> 
>>> Dear Wiki user,
>>> 
>>> You have subscribed to a wiki page or wiki category on "Pig Wiki" for
>>> change notification.
>>> 
>>> The following page has been changed by AlanGates:
>>> http://wiki.apache.org/pig/ProposedProjects
>>> 
>>> New page:
>>> = Proposed Pig Projects =
>>> This page describes projects what we (the committers) would like to
>>> see
>>> added
>>> to Pig.  The scale of these projects vary, but they are larger
>>> projects,
>>> usually on the weeks or months scale.  We have not yet filed
>>> [https://issues.apache.org/jira/browse/PIG JIRAs] for some of these
>>> because they are still in the vague idea stage.  As they become more
>>> concrete,
>>> [https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for
>>> them.
>>> 
>>> We welcome contributers to take on one of these projects.  If you
>>> would
>>> like
>>> to do so, please file a JIRA (if one does not already exist for the
>>> project)
>>> with a proposed solution.  Pig's committers will work with you from
>>> there
>>> to
>>> help refine your solution.  Once a solution is agreed upon, you can
>>> begin
>>> implementation.
>>> 
>>> If you see a project here that you would like to see Pig implement
>>> but you
>>> are
>>> not in a position to implement the solution right now, feel free to
>>> vote
>>> for
>>> the project.  Add your name to the list of supporters.  This will
>>> help
>>> contributers looking for a project to select one that will benefit
>>> many
>>> users.
>>> 
>>> If you would like to propose a project for Pig, feel free to add to
>>> this
>>> list.
>>> If it is a smaller project, or something you plan to begin work on
>>> immediately, filing a [https://issues.apache.org/jira/browse/PIG
>>> JIRA] is
>>> a better route.
>>> 
>>> || Catagory || Project || JIRA || Proposed By || Votes For ||
>>> || Execution || Pig currently executes scripts by building a
>>> pipeline of
>>> pre-built operators and running data through those operators in map
>>> reduce
>>> jobs.  We need to investigate instead have Pig generate java code
>>> specific
>>> to a job, and then compiling that code and using it to run the map
>>> reduce
>>> jobs. || || Many conference attendees || gates ||
>>> || Language || Currently only DISTINCT, ORDER BY, and FILTER are
>>> allowed
>>> inside FOREACH.  All operators should be allowed in FOREACH. (Limit
>>> is being
>>> worked on [https://issues.apache.org/jira/browse/PIG-741 741] || ||
>>> gates
>>> || ||
>>> || Optimization || Speed up comparison of tuples during shuffle for
>>> ORDER
>>> BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan
>>> || ||
>>> || Optimization || Order by should be changed to not use POPackage
>>> to put
>>> all of the tuples in a bag on the reduce side, as the bag is just
>>> immediately flattened.  It can instead work like join does for the
>>> last
>>> input in the join. || || gates || ||
>>> || Optimization || Often in a Pig script that produces a chain of
>>> MR jobs,
>>> the map phases of 2nd and subsequent jobs very little.  What little
>>> they do
>>> should be pushed into the proceeding reduce and the map replaced by
>>> the
>>> identity mapper.  Initial tests showed that the identity mapper was
>>> 50%
>>> faster than using a Pig mapper (because Pig uses the loader to
>>> parse out
>>> tuples even if the map itself is empty). || [
>>> https://issues.apache.org/jira/browse/PIG-480 480] || olgan ||
>>> gates ||
>>> || Optimization || Use hand crafted calls to do string to integer
>>> or float
>>> conversions.  Initial tests showed these could be done about 8x
>>> faster than
>>> String.toIntger() and String.toFloat(). || [
>>> https://issues.apache.org/jira/browse/PIG-482 482] || olgan ||
>>> gates ||
>>> || Optimization || Currently Pig always samples for and ORDER BY to
>>> determine how to partition, and then runs another job to do the
>>> sort.  For
>>> small enough inputs, it should just sort with a single reducer. || [
>>> https://issues.apache.org/jira/browse/PIG-483 483] || olgan || ||
>>> || Optimization || In many cases data to be joined is already
>>> sorted and
>>> partitioned on the same key.  Pig needs to be able to take
>>> advantage of this
>>> and do these joins in the map.  The join could be done by sampling
>>> one input
>>> to determine the value of the join key at the beginning of every
>>> HDFS block.
>>> This would form an index.  Then in a second MR job can be run with
>>> the
>>> other input.  Based on the key seen in the second input, the
>>> appropriate
>>> blocks of the first input can also be loaded into the map and the
>>> join done.
>>> || || gates || ||
>>> || Optimization || The combiner is not currently used if FILTER is
>>> in the
>>> FOREACH.  In some cases it could still be used.  || [
>>> https://issues.apache.org/jira/browse/PIG-479 479] || olgan || ||
>>> || Optimization || Currently when types of data are declared Pig
>>> inserts a
>>> FOREACH immediately after the LOAD that does the conversions.  These
>>> conversions should be delayed until the field is actually used. || [
>>> https://issues.apache.org/jira/browse/PIG-410 410] || olgan ||
>>> gates ||
>>> || Optimization || When an order by is not the only operation in a
>>> pig
>>> script, it is done in two additional MR jobs.  The first job
>>> samples using a
>>> sampling loader, the second does the sort.  The sample is used to
>>> construct
>>> a partitioner that equally balances the data in the sort.  The
>>> sampler needs
>>> to be changed to be a !EvalFunc instead of a loader.  This way a
>>> split can
>>> be but in the proceeding MR job, with the main data being written
>>> out and
>>> the other part flowing to the sampler func, which can then write
>>> out the
>>> sample.  The final MR job can then be the sort. || || gates || ||
>>> || Optimization || When an order by is the only operation in a pig
>>> script
>>> it is currently done in 3 MR jobs.  The first converts it to
>>> BinStorage
>>> format (because the sample loader reads that format), the second
>>> samples,
>>> and the third sorts.  Once the changes mentioned above to make the
>>> sampler
>>> an !EvalFunc are done it should be changed to be done in 2 MR jobs
>>> instead
>>> of 3. || [https://issues.apache.org/jira/browse/PIG-460  460] ||
>>> gates ||
>>> ||
>>> || Optimization || The Pig optimizer should be used to determine when
>>> fields in a record are no longer needed and put in FOREACH
>>> statements to
>>> project out the unecessary data as early as possible. || [
>>> https://issues.apache.org/jira/browse/PIG-466 466] || olgan || ||
>>> || Optimization || The Pig optimizers needs to call fieldsToRead so
>>> that
>>> Load functions that can do column skipping do it. || || gates || ||
>>> || Scalability || Pig's default join (symmetric hash) currently
>>> depends on
>>> being able to fit all of the values for a given join key for one of
>>> the
>>> inputs into memory.  (It does try to spill to disk in the case
>>> where it
>>> cannot fit them all into memory.  In practice this often fails as
>>> it is not
>>> good at understanding when memory is low enough that it should
>>> spill.  Even
>>> in the case where it does not fail, spilling to disk and rereading
>>> from disk
>>> is very slow.)  If instances of keys with a large number of values
>>> were
>>> broken up so that the row set could fit in memory and then shipped to
>>> multiple reducers.  A sampling pass would need to be done first to
>>> determine
>>> which keys to break up. || || chris olston || gates ||
>>> 
>> 
>> 
>> 
>> -- 
>> Nitesh Bhatia
>> Dhirubhai Ambani Institute of Information & Communication Technology
>> Gandhinagar
>> Gujarat
>> 
>> "Life is never perfect. It just depends where you draw the line."
>> 
>> visit:
>> http://www.awaaaz.com - connecting through music
>> http://www.volstreet.com - lets volunteer for better tomorrow
>> http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
> 


Mime
View raw message