crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: MRPipeline.cache()
Date Fri, 13 Nov 2015 17:44:17 GMT
To absolutely guarantee it only runs once, you should make reading/copying
the data from S3 into HDFS its own job by inserting a Pipeline.run() after
the call to cache() and before any subsequent processing on the data.
cache() will write data locally, but if you have N processes that want to
do something to the data, it won't necessarily guarantee that the caching
happens before the rest of the processes start trying to read the data w/o
a blocking call to run().

J

On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz <dpo5003@gmail.com> wrote:

> Hey,
>
>      If I have a super expensive to read input data set (think hundreds of
> GB of data on s3 for example), would I be able to use cache to make sure I
> only do the read once, then hand it out to the jobs that need it, as
> opposed to what crunch does by default, which is read it once for each
> parallel thread that needs the data?
>
> Thanks,
>      Dave
>

Mime
View raw message