crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: Optimizations for repeated operations
Date Thu, 25 Jun 2015 07:28:49 GMT
Hi Everett,

No, there aren't any currently any optimizations (or at least none that I'm
aware of) in Crunch that would skip a repeated operation like this. Any
call to parallelDo() and friends will always result in additional
operations being performed in the pipeline.

That being said, adding functionality like that might be as simple as
implementing equals and hashCode in one or more of the underlying
PCollection impls, so this might an interesting thing to look into further
if there's a need for it.

- Gabriel

On Wed, Jun 24, 2015 at 10:28 PM Everett Anderson <> wrote:

> Hi,
> I'm curious if Crunch attempts to perform any optimizations to avoid
> repeated operations, and, if so, how it figures out what's being repeated.
> For example, let's say I have PCollection called xCollection and a
> utility method joinAndProcess that extracts keys for two collections by
> MapFns, joins, and does a parallelDo on the result like this:
> public PCollection<String> joinAndProcess(
>     PCollection<String> left,
>     PCollection<Double> right) {
>   *PTable<Integer, String> keyedLeftTable =
> <>(someMapFn1);*
>   PTable<Integer, Double> keyedRightTable =;
>   PTable<Integer, Pair<String, Double>> joinedTable = ... join ...
>   return joinedTable.parallelDo(...);
> }
> If I call joinAndProcess(xCollection, some other collection) multiple
> times, will Crunch be able to notice that the highlighted
> (someMapFn1) is the same and reuse the result rather than recompute it?
> Would it be able to do so if the .by step were given the same name or
> same MapFn instance each time?
> Thanks,
> Everett
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

View raw message