crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dor...@videologygroup.com>
Subject RE: Processing many map only collections in single pipeline with spark
Date Mon, 18 Jul 2016 18:08:06 GMT
Sorry.  Meant with MR.  May be more helpful to try and fix the issue there, then see if it
carries over to Spark or not since we are not sure if we expect that to work at all.

From: Ben Juhn [mailto:benjijuhn@gmail.com]
Sent: Monday, July 18, 2016 2:05 PM
To: user@crunch.apache.org
Subject: Re: Processing many map only collections in single pipeline with spark

It’s doing the same thing.  One job shows up in the spark UI at a time.

Thanks,
Ben
On Jul 16, 2016, at 7:29 PM, David Ortiz <dpo5003@gmail.com<mailto:dpo5003@gmail.com>>
wrote:

Hmm.  Just out of curiosity, what if you do Pipeline.read in place of readTextFile?

On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <benjijuhn@gmail.com<mailto:benjijuhn@gmail.com>>
wrote:
Nope, it queues up the jobs in series there too.

On Jul 16, 2016, at 6:01 PM, David Ortiz <dpo5003@gmail.com<mailto:dpo5003@gmail.com>>
wrote:

*run in parallel

On Sat, Jul 16, 2016, 5:36 PM David Ortiz <dpo5003@gmail.com<mailto:dpo5003@gmail.com>>
wrote:
Just out of curiosity, if you use mrpipeline does it fun on parallel?  If so, issue may be
in spark since I believe crunch leaves it to spark to handle best method of execution.

On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <benjijuhn@gmail.com<mailto:benjijuhn@gmail.com>>
wrote:
Hey David,

I have 100 active executors, each job typically only uses a few.  It’s running on yarn.

Thanks,
Ben

On Jul 16, 2016, at 12:53 PM, David Ortiz <dpo5003@gmail.com<mailto:dpo5003@gmail.com>>
wrote:

What are the cluster resources available vs what a single map uses?

On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <benjijuhn@gmail.com<mailto:benjijuhn@gmail.com>>
wrote:
I enabled FAIR scheduling hoping that would help but only one job is showing up a time.

Thanks,
Ben

On Jul 15, 2016, at 8:17 PM, Ben Juhn <benjijuhn@gmail.com<mailto:benjijuhn@gmail.com>>
wrote:

Each input is of a different format, and the DoFn implementation handles them depending on
instantiation parameters.

Thanks,
Ben

On Jul 15, 2016, at 7:09 PM, Stephen Durfey <sjdurfey@gmail.com<mailto:sjdurfey@gmail.com>>
wrote:

Instead of using readTextFile on the pipeline, try using the read method and use the TextFileSource,
which can accept in a collection of paths.

https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java



On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.com<mailto:benjijuhn@gmail.com>>
wrote:
Hello,

I have a job configured the following way:

for (String path : paths) {
    PCollection<String> col = pipeline.readTextFile(path);
    col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“ + path),
Target.WriteMode.APPEND);
}
pipeline.done();

It results in one spark job for each path, and the jobs run in sequence even though there
are no dependencies.  Is it possible to have the jobs run in parallel?

Thanks,

Ben






This email is intended only for the use of the individual(s) to whom it is addressed. If you
have received this communication in error, please immediately notify the sender and delete
the original email.
Mime
View raw message