crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Juhn <benjij...@gmail.com>
Subject Re: Processing many map only collections in single pipeline with spark
Date Mon, 18 Jul 2016 19:10:58 GMT
Thanks David,

I bumped crunch.max.running.jobs to 10 and am seeing job parallelism with MR.  I tried the
same with spark and am still only seeing one job show up at a time.

Thanks,
Ben

> On Jul 18, 2016, at 11:08 AM, David Ortiz <dortiz@videologygroup.com> wrote:
> 
> Sorry.  Meant with MR.  May be more helpful to try and fix the issue there, then see
if it carries over to Spark or not since we are not sure if we expect that to work at all.
>  
> From: Ben Juhn [mailto:benjijuhn@gmail.com] 
> Sent: Monday, July 18, 2016 2:05 PM
> To: user@crunch.apache.org
> Subject: Re: Processing many map only collections in single pipeline with spark
>  
> It’s doing the same thing.  One job shows up in the spark UI at a time.
>  
> Thanks,
> Ben
> On Jul 16, 2016, at 7:29 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
>  
> Hmm.  Just out of curiosity, what if you do Pipeline.read in place of readTextFile?
>  
> On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
> Nope, it queues up the jobs in series there too.
>  
> On Jul 16, 2016, at 6:01 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
>  
> *run in parallel
>  
> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
> Just out of curiosity, if you use mrpipeline does it fun on parallel?  If so, issue may
be in spark since I believe crunch leaves it to spark to handle best method of execution.
>  
> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
> Hey David,
>  
> I have 100 active executors, each job typically only uses a few.  It’s running on yarn.
>  
> Thanks,
> Ben
>  
> On Jul 16, 2016, at 12:53 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
>  
> What are the cluster resources available vs what a single map uses?
>  
> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
> I enabled FAIR scheduling hoping that would help but only one job is showing up a time.
>  
> Thanks,
> Ben
>  
> On Jul 15, 2016, at 8:17 PM, Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
>  
> Each input is of a different format, and the DoFn implementation handles them depending
on instantiation parameters.
>  
> Thanks,
> Ben
>  
> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <sjdurfey@gmail.com <mailto:sjdurfey@gmail.com>>
wrote:
>  
> Instead of using readTextFile on the pipeline, try using the read method and use the
TextFileSource, which can accept in a collection of paths. 
> 
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
<https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java>
>  
> 
> 
> 
> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
> 
> Hello,
>  
> I have a job configured the following way:
> for (String path : paths) {
>     PCollection<String> col = pipeline.readTextFile(path);
>     col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“
+ path), Target.WriteMode.APPEND);
> }
> pipeline.done();
> It results in one spark job for each path, and the jobs run in sequence even though there
are no dependencies.  Is it possible to have the jobs run in parallel?
> Thanks,
> Ben
>  
>  
>  
>  
>  
>  
> This email is intended only for the use of the individual(s) to whom it is addressed.
If you have received this communication in error, please immediately notify the sender and
delete the original email.


Mime
View raw message