incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Checkpointing in pipelines
Date Thu, 06 Sep 2012 04:16:03 GMT
Hi guys,  

In some instances, we want to do some kind of iterative processing in Crunch, and run the
same (or a similar) DoFn on the same PCollection multiple times.

For example, let's say we've got a PCollection of "grid" objects, and we want to iteratively
divide each of these grids into four sub-grids, leading to exponential growth of the data.
The naive way to do this would be to do the following:

PCollection<Grid> grids = …;
for (…){
   grids = grids.parallelDo(new SubdivideFn());

However, the above code would be optimized into a single string of DoFns, and not increasing
the number of mappers we've got per iteration, which of course wouldn't work well with the
exponential growth of data.

The current way of getting around this is to add a call to materialize().iterator() on the
PCollection in each iteration (this is also done in the PageRankIT integration test).

What I propose is adding a "checkpoint" method to PCollection to signify that this should
be an actual step in processing. This could work as follows:

PCollection<Grid> grids = …;
for (…){
   grids = grids.parallelDo(new SubdivideFn()).checkpoint();

In the short term this could even be implemented as just a call to materialize().iterator(),
but putting encapsulating it in a method like this would allow us to work more efficiently
with it in the future, especially once CRUNCH-34 is merged.

Any thoughts on this? The actual name of the method is my biggest concern, I'm not sure if
"checkpoint" is the best name for it, but I can't think of anything better at the moment.

- Gabriel  

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message