crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid" <gr...@apache.org>
Subject Re: Review Request: CRUNCH-128: Enable pipeline stages to depend on files being created on the filesystem.
Date Tue, 11 Dec 2012 22:10:32 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8463/#review14317
-----------------------------------------------------------


In general it seems pretty good to me. It would be nice if there were some more tests on this
functionality specifically (outside of being dependent on the MapsideJoinIT), but I'm also
having a hard time thinking of some good test cases. One idea of an additional test that *is*
related to MapsideJoins could be to ensure that multiple mapside joins run in parallel do
their setup in the same job (instead of each being run in separate jobs like in the past).

I did run into one issue by accident that put the planner in an infinite loop, but I'll put
the details of that on JIRA.


crunch/src/it/java/org/apache/crunch/lib/join/MapsideJoinIT.java
<https://reviews.apache.org/r/8463/#comment30490>

    This test should probably be called testMapsideJoin_MemPipeline instead now. Previously
the point of it was to point out that anything that wasn't a MRPipeline wouldn't work, whereas
now the point is to show that it specifically does work with a MemPipeline.



crunch/src/main/java/org/apache/crunch/impl/mr/collect/PCollectionImpl.java
<https://reviews.apache.org/r/8463/#comment30492>

    Is there a reason that a non-immutable Set isn't just used from in the start? When you
see this statement (instantiating a new set if the existing one is empty), it looks like a
mistake at first.


- Gabriel Reid


On Dec. 11, 2012, 7:26 a.m., Josh Wills wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8463/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2012, 7:26 a.m.)
> 
> 
> Review request for crunch and Gabriel Reid.
> 
> 
> Description
> -------
> 
> This involves updating the PCollectionImpl class to be able to track any SourceTarget
instances that it needs to exist before any Target that depends on this PCollectionImpl can
be created, and optimizing the MSCRPlanner to check for this information and build the jobs
to incorporate these dependencies.
> 
> This isn't the prettiest implementation of this idea, but I think it'll turn out to be
a useful thing to have.
> 
> 
> This addresses bug CRUNCH-128.
>     https://issues.apache.org/jira/browse/CRUNCH-128
> 
> 
> Diffs
> -----
> 
>   crunch/src/it/java/org/apache/crunch/lib/join/MapsideJoinIT.java 297680e 
>   crunch/src/main/java/org/apache/crunch/Pipeline.java bcf8727 
>   crunch/src/main/java/org/apache/crunch/impl/mem/MemPipeline.java 77c41ce 
>   crunch/src/main/java/org/apache/crunch/impl/mr/MRPipeline.java 60950f3 
>   crunch/src/main/java/org/apache/crunch/impl/mr/collect/PCollectionImpl.java f0d8187

>   crunch/src/main/java/org/apache/crunch/impl/mr/plan/MSCRPlanner.java 7fe2809 
>   crunch/src/main/java/org/apache/crunch/io/ReadableSourceTarget.java 95c90aa 
>   crunch/src/main/java/org/apache/crunch/lib/join/MapsideJoin.java 0ca1ab3 
>   crunch/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java 3830616

> 
> Diff: https://reviews.apache.org/r/8463/diff/
> 
> 
> Testing
> -------
> 
> Updated the mapside join IT to use the new code and fixed the in-memory impl to work
properly.
> 
> 
> Thanks,
> 
> Josh Wills
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message