crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <>
Subject Optimizations for repeated operations
Date Wed, 24 Jun 2015 20:27:02 GMT

I'm curious if Crunch attempts to perform any optimizations to avoid
repeated operations, and, if so, how it figures out what's being repeated.

For example, let's say I have PCollection called xCollection and a utility
method joinAndProcess that extracts keys for two collections by MapFns,
joins, and does a parallelDo on the result like this:

public PCollection<String> joinAndProcess(
    PCollection<String> left,
    PCollection<Double> right) {
  *PTable<Integer, String> keyedLeftTable =
  PTable<Integer, Double> keyedRightTable =;
  PTable<Integer, Pair<String, Double>> joinedTable = ... join ...
  return joinedTable.parallelDo(...);

If I call joinAndProcess(xCollection, some other collection) multiple
times, will Crunch be able to notice that the highlighted
(someMapFn1) is the same and reuse the result rather than recompute it?

Would it be able to do so if the .by step were given the same name or same
MapFn instance each time?


*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

View raw message