pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2632) Create a SchemaTuple which generates efficient Tuples via code gen
Date Sun, 08 Apr 2012 07:20:43 GMT

    [ https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249490#comment-13249490
] 

jiraposter@reviews.apache.org commented on PIG-2632:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/
-----------------------------------------------------------

(Updated 2012-04-08 07:19:58.096859)


Review request for pig and Julien Le Dem.


Changes
-------

Julien, this incorporates many of your comments, but not all. Mainly, it has the refactoring
of the code. A couple existant issues:
- The classloading is still janky. I'm not quite sure what the best approach is
- I need to figure out how to register the classes I generate in the jar manifest
- because of the way the code is generated, protected fields don't quite work. The code doesn't
have a package, so only public methods are available. I marked the classes it dependended
on as private, but I don't know if that is enough. If it's a big issue, I guess the next thing
to do is to figure out how to generate code in a specific package of my choice, and ideally,
how to generate the class in memory and add to the jar.
- And of course some finer points: I need to implement a raw comparator, etc

But I'd like to know if the general new structure works. Of course it's definitely a big time
work in progress, but the comments really help.

Lastly, I'd like to know how this should interact with PrimitiveTuples. I still think there
is a place for them (since SchemaTuples have to be generated on the front end but PrimitiveTuples
do not), but the whole TupleFactory.newTupleForSchema thing is weird... I went with a TupleFactory.getInstanceForSchema(Schema)
approach and liked it a lot more. another question is what to do when the Schema can't be
generated... one option is to just return a tuple, and another is to fail out. IMHO we should
fail out, and require people to ensure it's generatable, but I can see the argument otherwise.
In general, for things like this, I think it's better to fail early and explicitly than to
let people think they have a special Tuple when they don't. Philosophies may differ.


Summary
-------

This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema on
the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary
tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily
depends on the data, though). Need to do get/set tests, but assuming that it's on par (or
even faster) than Tuple, the memory gain is huge.

Need to clean up the code and add tests.

Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to UDF's.
The next step is to make a SchemaBag, where I think the serialization savings will be really
huge.

Needs tests and comments, but I want the code to settle a bit.


This addresses bug PIG-2632.
    https://issues.apache.org/jira/browse/PIG-2632


Diffs (updated)
-----

  trunk/bin/pig 1310666 
  trunk/build.xml 1310666 
  trunk/ivy.xml 1310666 
  trunk/ivy/libraries.properties 1310666 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1310666 
  trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 
  trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION 
  trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 
  trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
  trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION 
  trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION 
  trunk/src/org/apache/pig/data/Tuple.java 1310666 
  trunk/src/org/apache/pig/data/TupleFactory.java 1310666 
  trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 
  trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION 
  trunk/src/org/apache/pig/impl/PigContext.java 1310666 
  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666 

Diff: https://reviews.apache.org/r/4651/diff


Testing
-------


Thanks,

Jonathan


                
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema
on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary
tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily
depends on the data, though). Need to do get/set tests, but assuming that it's on par (or
even faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to
UDF's. The next step is to make a SchemaBag, where I think the serialization savings will
be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message