crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clément MATHIEU <clem...@unportant.info>
Subject Re: PipelineResult VS materialize()
Date Wed, 06 Jan 2016 17:04:55 GMT
On 2016-01-06 17:19, Josh Wills wrote:

Hi Josh,

> I added a getPipelineResult() method to the MaterializableIterable in
> CRUNCH-400: does it not do what you want?
> https://github.com/apache/crunch/commit/ded504eb133fa0814e2d90ff2a662e72a67e04bb
> [2]

It indeed gives access to the PipelineResult, but I find it error-prone:

  - It is hidden in an Iterable which needs to be cast

  - The code dealing with the iterable is most likely business code which 
does not care at all about infrastructure concerns

  - One has to wait until iterator() is called to get the result but 
cannot be notified


I might be wrong but I believe that collecting all the counters of a 
pipeline is a common pattern.

My team has been burned several times by "missing counters" (dev not 
knowing the MaterializeIterable trick, oversight, calling 
getPipelineResult before iterator() is actually called, things just 
"worked" until they moved a call to run after the materialize, etc.).

I am wondering how other Crunch users are dealing with counter 
collection. Do they always carefully extract the PipelineResult from 
each iterable after usage ? Are they happy with this pattern ? Did they 
hack something like my HyperthymesticMRPipeline or something else ?


Regards,

Clément




Mime
View raw message