crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Planning Optimization for Sort
Date Tue, 13 Jan 2015 21:36:56 GMT
I counted two reads of the first job instead of three-- are you writing out
the "data" PCollection as part of the job as well?

Trying to think of how I would want to communicate the fact that the s3
read is slow/expensive to the planner; maybe a bit on Source that could be
used to signal an expensive source that should only ever be read once?

On Tue, Jan 13, 2015 at 1:11 PM, Danny Morgan <unluckyboy@hotmail.com>
wrote:

> Hi Everyone,
>
> I have a crunch job that reads some data from s3 and applies a simple
> MapFn and then does a total order sort.
>
> PCollection<String> rawdata = readTextFile("s3n://data");
> PCollection<String> data = rawdata.parallelDo(new myMapFn());
> Sort.sort(data);
>
> I noticed that Sort from the sort library works in two phases the former
> being called the presort phase. When I execute this pipeline as is the data
> is read and transformed three times, the first time to generate the
> PCollections, second time for the presort phase, and third for the final
> sort.
>
> The snippet below ends up only reading the data from s3 once.
>
> PCollection<String> rawdata = readTextFile("s3n://data");
> PCollection<String> data = rawdata.parallelDo(new myMapFn());
> data.cache();
> pipeline.run();
> Sort.sort(data);
>
> Might be a crunch planner optimization opportunity?
>
> Thanks!
>
> Danny
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message