incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills" <jwi...@cloudera.com>
Subject Re: Review Request: CRUNCH-128: Enable pipeline stages to depend on files being created on the filesystem.
Date Thu, 13 Dec 2012 01:46:09 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8463/
-----------------------------------------------------------

(Updated Dec. 13, 2012, 1:46 a.m.)


Review request for crunch and Gabriel Reid.


Changes
-------

Came up with a much better way to do this-- we add the SourceTargets that are needed for the
DoFn to run to the parallelDo call itself, which eliminates the possibility of cyclic dependencies.
Also updated the mapside join IT to verify that we run the right number of MapReduces, even
if the mapside joins are applied out-of-order.


Description
-------

This involves updating the PCollectionImpl class to be able to track any SourceTarget instances
that it needs to exist before any Target that depends on this PCollectionImpl can be created,
and optimizing the MSCRPlanner to check for this information and build the jobs to incorporate
these dependencies.

This isn't the prettiest implementation of this idea, but I think it'll turn out to be a useful
thing to have.


This addresses bug CRUNCH-128.
    https://issues.apache.org/jira/browse/CRUNCH-128


Diffs (updated)
-----

  crunch/src/it/java/org/apache/crunch/lib/join/MapsideJoinIT.java 297680e 
  crunch/src/it/java/org/apache/crunch/lib/join/MapsideJoinIT.java 297680e 
  crunch/src/main/java/org/apache/crunch/PCollection.java f5a3465 
  crunch/src/main/java/org/apache/crunch/Pipeline.java bcf8727 
  crunch/src/main/java/org/apache/crunch/Pipeline.java bcf8727 
  crunch/src/main/java/org/apache/crunch/impl/mem/MemPipeline.java 77c41ce 
  crunch/src/main/java/org/apache/crunch/impl/mem/MemPipeline.java 77c41ce 
  crunch/src/main/java/org/apache/crunch/impl/mem/collect/MemCollection.java 61bb1e7 
  crunch/src/main/java/org/apache/crunch/impl/mr/MRPipeline.java 60950f3 
  crunch/src/main/java/org/apache/crunch/impl/mr/MRPipeline.java 60950f3 
  crunch/src/main/java/org/apache/crunch/impl/mr/collect/DoCollectionImpl.java 1f4fea2 
  crunch/src/main/java/org/apache/crunch/impl/mr/collect/DoTableImpl.java 1d19580 
  crunch/src/main/java/org/apache/crunch/impl/mr/collect/PCollectionImpl.java f0d8187 
  crunch/src/main/java/org/apache/crunch/impl/mr/collect/PCollectionImpl.java f0d8187 
  crunch/src/main/java/org/apache/crunch/impl/mr/collect/PTableBase.java 9183784 
  crunch/src/main/java/org/apache/crunch/impl/mr/plan/MSCRPlanner.java 7fe2809 
  crunch/src/main/java/org/apache/crunch/io/ReadableSourceTarget.java 95c90aa 
  crunch/src/main/java/org/apache/crunch/lib/join/MapsideJoin.java 0ca1ab3 
  crunch/src/main/java/org/apache/crunch/lib/join/MapsideJoin.java 0ca1ab3 
  crunch/src/main/java/org/apache/crunch/materialize/MaterializableIterable.java 3830616 

Diff: https://reviews.apache.org/r/8463/diff/


Testing
-------

Updated the mapside join IT to use the new code and fixed the in-memory impl to work properly.


Thanks,

Josh Wills


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message