crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Durfey <sjdur...@gmail.com>
Subject Re: Processing many map only collections in single pipeline with spark
Date Sat, 16 Jul 2016 02:09:32 GMT
Instead of using readTextFile on the pipeline, try using the read method and use the TextFileSource,
which can accept in a collection of paths. 

https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java





On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.com> wrote:










Hello,
I have a job configured the following way:for (String path : paths) {
    PCollection<String> col = pipeline.readTextFile(path);
    col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“ + path),
Target.WriteMode.APPEND);
}
pipeline.done();It results in one spark job for each path, and the jobs run in sequence even
though there are no dependencies.  Is it possible to have the jobs run in parallel?Thanks,Ben






Mime
View raw message