crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Juhn <benjij...@gmail.com>
Subject Re: Processing many map only collections in single pipeline with spark
Date Mon, 18 Jul 2016 18:04:52 GMT
It’s doing the same thing.  One job shows up in the spark UI at a time.

Thanks,
Ben
> On Jul 16, 2016, at 7:29 PM, David Ortiz <dpo5003@gmail.com> wrote:
> 
> Hmm.  Just out of curiosity, what if you do Pipeline.read in place of readTextFile?
> 
> 
> On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
> Nope, it queues up the jobs in series there too.
> 
>> On Jul 16, 2016, at 6:01 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
>> 
>> *run in parallel
>> 
>> 
>> On Sat, Jul 16, 2016, 5:36 PM David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
>> Just out of curiosity, if you use mrpipeline does it fun on parallel?  If so, issue
may be in spark since I believe crunch leaves it to spark to handle best method of execution.
>> 
>> 
>> On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
>> Hey David,
>> 
>> I have 100 active executors, each job typically only uses a few.  It’s running
on yarn.
>> 
>> Thanks,
>> Ben
>> 
>>> On Jul 16, 2016, at 12:53 PM, David Ortiz <dpo5003@gmail.com <mailto:dpo5003@gmail.com>>
wrote:
>>> 
>>> What are the cluster resources available vs what a single map uses?
>>> 
>>> 
>>> On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
>>> I enabled FAIR scheduling hoping that would help but only one job is showing
up a time.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jul 15, 2016, at 8:17 PM, Ben Juhn <benjijuhn@gmail.com <mailto:benjijuhn@gmail.com>>
wrote:
>>>> 
>>>> Each input is of a different format, and the DoFn implementation handles
them depending on instantiation parameters.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>>> On Jul 15, 2016, at 7:09 PM, Stephen Durfey <sjdurfey@gmail.com <mailto:sjdurfey@gmail.com>>
wrote:
>>>>> 
>>>>> Instead of using readTextFile on the pipeline, try using the read method
and use the TextFileSource, which can accept in a collection of paths. 
>>>>> 
>>>>> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java
<https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <benjijuhn@gmail.com
<mailto:benjijuhn@gmail.com>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I have a job configured the following way:
>>>>> for (String path : paths) {
>>>>>     PCollection<String> col = pipeline.readTextFile(path);
>>>>>     col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“
+ path), Target.WriteMode.APPEND);
>>>>> }
>>>>> pipeline.done();
>>>>> It results in one spark job for each path, and the jobs run in sequence
even though there are no dependencies.  Is it possible to have the jobs run in parallel?
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>> 
>>> 
>> 
> 


Mime
View raw message