pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Coveney" <jcove...@gmail.com>
Subject Re: Review Request: SchemaTuple in Pig
Date Mon, 09 Apr 2012 16:25:27 GMT


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> >

Thanks again for taking a look Julien! I think we are close to having the structure where
we want it, there are just a couple of outstanding higher level points:
- How should PigContext be referenced?
- How should the intermediate compiled data be handle?
- What should the semantics of getting a SchemaTupleFactory be?
- Generating the SchemaTupleFactory so it can directly instantiate the given SchemaTuple
- What should and shouldn't be in the generated code

And probably some more. I responded below, and hopefully we can keep honing in on something
that looks good :)


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java,
lines 135-136
> > <https://reviews.apache.org/r/4651/diff/1-3/?file=100075#file100075line135>
> >
> >     instead of checking here if the tuple is generatable, the factory could default
to the regular TupleFactory if the generation fails.

I don't know that I like that semantic though. In this case, in the client code it's clear
that they know that they either are or are not getting a SchemaTupleFactory... if we default
to a regular TupleFactory, then we might think we are getting a SchemaTuple (and all the speed
and memory benefits) when we actually aren't. There are two cases where we would ask for a
SchemaTupleFactory and it wouldn't be available: a Pig bug in our implementation, or user
error. Shouldn't pig fail out intelligently instead of giving them a Factory that doesn't
do what they think?


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 31
> > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line31>
> >
> >     How complex is this class ? Not sure it is worth pulling the whole mahout just
for this.

It's very not complicated. I'll put the logic into SedesHelper.


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 84
> > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line84>
> >
> >     append...() methods should not be null
> >     (part of your TODO list?)

Not sure what you mean here


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 550
> > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line550>
> >
> >     handle the case where other == null
> >     Here you get NullPointerException

good call, will fix everywhere


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTuple.java, line 662
> > <https://reviews.apache.org/r/4651/diff/1-3/?file=100078#file100078line662>
> >
> >     If you override equals(), you should override hashCode(). 2 object that are
equal must return the same hashcode

whoops, I thought I had. will do


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 64
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line64>
> >
> >     try avoiding accessing the PigContext statically. Add it as a parameter and
see if it can be passed from somewhere it is actually known.

In the pig code that I looked at, this seemed by far the more common idiom (not that I love
it, but it's pretty hard to get the PigContext in there). I can't really find a single example
of code in this sort of path that doesn't get the PigContext statically. I could rewire Pig
to try and make this possible, but do you think that it is worth it/do you see any examples
in the code base that I missed where PigContext isn't called statically in this sort of case?
There are some examples of more purely front end things, and that's how I'd pipe it in I suppose,
but it seems like a big change for something that barely touches the codebase. I do think
a JIRA that tackles this issue in pig is reasonable though... trying to decrease the amount
of statically referenced jank. Await your thoughts.


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 71
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line71>
> >
> >     possibly we want to rename this and/or add something else for this file.

Question about class files: does a class file have to have the same name as the class it contains?
I know this is true for .java files with the java compiler, but not sure if it is true of
class files. If it is true of class files, then we can't muck with the name. If it isn't,
then ok.


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 86-87
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line86>
> >
> >     ?
> >     either get rid of the empty catch block or document why it is Ok.
> >     
> >     Here I don't see why it would be Ok that we can not instantiate that class

Hmm, the idea was that if you couldn't instantiate you'd regenerate, but looking at it now
you're right; it is odd that if we think we have the class, that we cannot instantiate it.
I'll throw a runtime error.


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 127
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line127>
> >
> >     "org.apache.pig.generated." + classname ?

Java question: if we're managing the compilation, does the compiled class have to be in org/apache/pig/generated
(or whatever) in the file directory?


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 138
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line138>
> >
> >     "org/apache/pig/generated/" + classname + ".class" ?

What's the win here? Or just general cleanliness?


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 140
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line140>
> >
> >     we should probably not compile in the current dir.
> >     Put it in the temp dir intead

I will see if this is possible. Do you know if Pig has an official local temp directory? I
found an hdfs temp directory flag, but couldn't find a local one.


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 322
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line322>
> >
> >     this is wasteful. Maybe DataByteArray could provide a static compare(byte[],
byte[]) ?

Lol, I checked and it exists. Will use.


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 265-269
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line265>
> >
> >     stuff which is not calling a generated field directly should be pulled up.

I don't know that I agree with this. Then we're going to have a bunch of really random methods
that have no context getting called for no real reason (I guess so that the compiler will
run it's checks instead of getting an error at compile time)? Some of the logic makes more
sense to do this with (the comparison of appends above, for example), but do you think the
win for highly specific pieces of code like this is worth it?


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, lines 392-396
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line392>
> >
> >     the exception handling can be pulled up

Same as above. Should it really be our goal to move everything humanly possible into SchemaTuple?


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 518
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line518>
> >
> >     same with a mapping for the java object for the type

Not sure what you want the code to look like here. If you want to explicitly do all the casts,
it will make the generated code bigger, and would seem to contradict your suggestion to call
as much from SchemaTuple as possible?


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java, line 559
> > <https://reviews.apache.org/r/4651/diff/3/?file=100952#file100952line559>
> >
> >     you can just remove the call to box, it should work

Good call


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 42
> > <https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line42>
> >
> >     do we need this. Let's just fall back to regular tuple when generation fails,
so that we don't need to keep those in sync

See above.


> On 2012-04-09 05:34:39, Julien Le Dem wrote:
> > trunk/src/org/apache/pig/data/SchemaTupleFactory.java, line 59
> > <https://reviews.apache.org/r/4651/diff/3/?file=100953#file100953line59>
> >
> >     The factory could be generated as well so that we don't need to use reflection
here.
> >     Reflection is slower to create new instances.

I didn't realize that doing Class.newInstance was slower than direct creation. I'll generate
it.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4651/#review6784
-----------------------------------------------------------


On 2012-04-08 22:26:29, Jonathan Coveney wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4651/
> -----------------------------------------------------------
> 
> (Updated 2012-04-08 22:26:29)
> 
> 
> Review request for pig and Julien Le Dem.
> 
> 
> Summary
> -------
> 
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing the Schema
on the frontend, we can code generate Tuples which can be used for fun and profit. In rudimentary
tests, the memory efficiency is 2-4x better, and it's ~15% smaller serialized (heavily heavily
depends on the data, though). Need to do get/set tests, but assuming that it's on par (or
even faster) than Tuple, the memory gain is huge.
> 
> Need to clean up the code and add tests.
> 
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema given to
UDF's. The next step is to make a SchemaBag, where I think the serialization savings will
be really huge.
> 
> Needs tests and comments, but I want the code to settle a bit.
> 
> 
> This addresses bug PIG-2632.
>     https://issues.apache.org/jira/browse/PIG-2632
> 
> 
> Diffs
> -----
> 
>   trunk/bin/pig 1310666 
>   trunk/build.xml 1310666 
>   trunk/ivy.xml 1310666 
>   trunk/ivy/libraries.properties 1310666 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java
1310666 
>   trunk/src/org/apache/pig/data/BinInterSedes.java 1310666 
>   trunk/src/org/apache/pig/data/FieldIsNullException.java PRE-CREATION 
>   trunk/src/org/apache/pig/data/PrimitiveTuple.java 1310666 
>   trunk/src/org/apache/pig/data/SchemaTuple.java PRE-CREATION 
>   trunk/src/org/apache/pig/data/SchemaTupleClassGenerator.java PRE-CREATION 
>   trunk/src/org/apache/pig/data/SchemaTupleFactory.java PRE-CREATION 
>   trunk/src/org/apache/pig/data/Tuple.java 1310666 
>   trunk/src/org/apache/pig/data/TupleFactory.java 1310666 
>   trunk/src/org/apache/pig/data/TypeAwareTuple.java 1310666 
>   trunk/src/org/apache/pig/data/utils/SedesHelper.java PRE-CREATION 
>   trunk/src/org/apache/pig/impl/PigContext.java 1310666 
>   trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1310666

> 
> Diff: https://reviews.apache.org/r/4651/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jonathan
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message