crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dor...@videologygroup.com>
Subject RE: MRPipeline.cache()
Date Fri, 13 Nov 2015 17:46:26 GMT
So I would want something likeā€¦

PCollection<String> rawInput = pipeline.read(From.textFile(s3Path);
rawInput.cache()
pipeline.run()

//Do other processing followed by another pipeline.run()
?

Thanks,
     Dave

From: Josh Wills [mailto:josh.wills@gmail.com]
Sent: Friday, November 13, 2015 12:44 PM
To: user@crunch.apache.org
Subject: Re: MRPipeline.cache()

To absolutely guarantee it only runs once, you should make reading/copying the data from S3
into HDFS its own job by inserting a Pipeline.run() after the call to cache() and before any
subsequent processing on the data. cache() will write data locally, but if you have N processes
that want to do something to the data, it won't necessarily guarantee that the caching happens
before the rest of the processes start trying to read the data w/o a blocking call to run().

J

On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz <dpo5003@gmail.com<mailto:dpo5003@gmail.com>>
wrote:
Hey,

     If I have a super expensive to read input data set (think hundreds of GB of data on s3
for example), would I be able to use cache to make sure I only do the read once, then hand
it out to the jobs that need it, as opposed to what crunch does by default, which is read
it once for each parallel thread that needs the data?

Thanks,
     Dave

This email is intended only for the use of the individual(s) to whom it is addressed. If you
have received this communication in error, please immediately notify the sender and delete
the original email.
Mime
View raw message