pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Ryaboy" <dvrya...@gmail.com>
Subject Re: Review Request: PIG-2831: MR-Cube implementation (Distributed cubing for holistic measures)
Date Fri, 17 Aug 2012 20:43:58 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6651/#review10473
-----------------------------------------------------------


Publishing the first set of comments (got through to MapReduceOper). More later.


src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/6651/#comment22332>

    does this break if we have 2 cube operators? should we add some identifier to the filename?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/6651/#comment22333>

    no, let's rely on the general estimator framework rather than special-casing. Having separate
chunks of code dealing with this through the codebase will lead to confusion.  Now that we
allow passing a full plan into the reducer estimator, it will be possible to write an estimator
that walks the plan and uses factors based on knowledge of inputs and operators that work
on the input to adjust the naive estimation; we should encode any cube operator-specific logic
into such an estimator.
    
    I think you can go ahead and remove this comment block.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/6651/#comment22334>

    kill dead code.
    
    let's note that the minimal number of reducers should be the max partition value, and
how to get it, in POCube documentation. That way, whoever implements an improved estimator
can follow this advice when they encounter POCube.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22351>

    you are adding a lot of code to this class. At 2800 lines, it's already too big a piece
of code.
    
    perhaps some of it can be factored out into CubeCompiler or something similar? 



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22335>

    break this code out into its own function.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22337>

    can you add a comment describing what the plan looks like before and after this is applied?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22336>

    COUNT.class.getName(), COUNT_STAR.class.getName() 
    
    



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22338>

    The trick of turning COUNTs into SUMs of COUNTS is already implicitly encoded in COUNT's
implementation of Algebraic. Could we use the algebraic interface to do this generically,
rather than hardcoding individual functions in the compiler?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22339>

    why 0? "input" might be projecting something else, right? Can you reuse "input"?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22340>

    @Override



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22341>

    drop "== true", it's already a boolean.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22342>

    as discussed, getting the total number of records is problematic.
    
    What if we write a sampler that does something along the lines of opening all input splits
and reading from each until 2M are reached or input is exhausted?
    
    This doesn't let us use the 100K for 2m-2b heuristic, but it's something.
    
    this is also where we can look at LoadFunc implementations and see if we can just get
an estimated total # of records from it -- if we can, we don't have to do the above compromise,
and can apply all the heuristics described by Arnab.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22344>

    go ahead and write a test.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22346>

    don't compare booleans to true. 
    
    just do
    
    if (className.equals(...)) {
    
    }



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22347>

    same



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
<https://reviews.apache.org/r/6651/#comment22349>

    How come? That sounds a little scary. Let's understand this.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceOper.java
<https://reviews.apache.org/r/6651/#comment22353>

    can these be added to POCube and extracted from within the contained plans instead?


- Dmitriy Ryaboy


On Aug. 16, 2012, 10:19 p.m., Prasanth_J wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6651/
> -----------------------------------------------------------
> 
> (Updated Aug. 16, 2012, 10:19 p.m.)
> 
> 
> Review request for pig and Dmitriy Ryaboy.
> 
> 
> Description
> -------
> 
> This is a review board request for https://issues.apache.org/jira/browse/PIG-2831
> 
> 
> This addresses bug PIG-2831.
>     https://issues.apache.org/jira/browse/PIG-2831
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
8029dec 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1d05a20

>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceOper.java
b87c209 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRPrinter.java
157caad 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java
cde340c 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java
ff65146 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCube.java
PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 0502917 
>   src/org/apache/pig/builtin/CubeDimensions.java 5652029 
>   src/org/apache/pig/builtin/PigStorage.java 21e835f 
>   src/org/apache/pig/builtin/RollupDimensions.java f6c26e4 
>   src/org/apache/pig/impl/builtin/HolisticCube.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/HolisticCubeCompoundKey.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/PartitionMaxGroup.java PRE-CREATION 
>   src/org/apache/pig/impl/builtin/PostProcessCube.java PRE-CREATION 
>   src/org/apache/pig/impl/io/ReadSingleLoader.java PRE-CREATION 
>   src/org/apache/pig/impl/util/Utils.java 270cb6a 
>   src/org/apache/pig/newplan/logical/optimizer/LogicalPlanPrinter.java 13439c6 
>   src/org/apache/pig/newplan/logical/relational/LOCube.java b262efb 
>   src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 127ab7a

>   src/org/apache/pig/newplan/logical/rules/ColumnPruneHelper.java 369f5c2 
>   src/org/apache/pig/parser/LogicalPlanBuilder.java 289a76f 
>   src/org/apache/pig/pen/EquivalenceClasses.java 194f8cb 
>   src/org/apache/pig/pen/LineageTrimmingVisitor.java 917073c 
>   src/org/apache/pig/pen/util/DisplayExamples.java 265f8f7 
>   test/org/apache/pig/impl/builtin/TestHolisticCubeCompundKey.java PRE-CREATION 
>   test/org/apache/pig/impl/builtin/TestPartitionMaxGroup.java PRE-CREATION 
>   test/org/apache/pig/impl/builtin/TestPostProcessCube.java PRE-CREATION 
>   test/org/apache/pig/test/TestCubeOperator.java 65d56a6 
> 
> Diff: https://reviews.apache.org/r/6651/diff/
> 
> 
> Testing
> -------
> 
> Unit tests: All passed
> 
> Pre-commit tests: All passed
> ant clean test-commit
> 
> 
> Thanks,
> 
> Prasanth_J
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message