crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Morgan <>
Subject RE: Multiple Reduces in a Single Crunch Job
Date Wed, 26 Nov 2014 00:24:03 GMT
Hello Again Josh,
The link to the Jira issue you sent out seems to be cut off, could you please resend it?
I deleted the line where I write the collection to a text file, and retried it but it didn't
work either. Also tried writing the collection out as Avro instead of Parquet, but got the
same error.
Here's the rest of the stracktrace:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs:///tmp/crunch-2008950085/p1
       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(
       at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(
       at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits( 
      at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(     
  at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(   
    at org.apache.hadoop.mapreduce.Job$        at org.apache.hadoop.mapreduce.Job$
       at Method)        at
       at org.apache.hadoop.mapreduce.Job.submit(        at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(
       at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(
       at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(
  at$000(        at$
Thanks Josh!
Date: Tue, 25 Nov 2014 16:10:33 -0800
Subject: Re: Multiple Reduces in a Single Crunch Job

Hey Danny,
I'm wondering if this is caused by I think
we use different output committers for text files vs. parquet files, so at least one of the
outputs won't be written properly-- does that make sense?
On Tue, Nov 25, 2014 at 4:07 PM, Danny Morgan <> wrote:

Hi Crunchers,
I've attached a pdf of what my plan looks like. I've run into this problem before where I
have multiple reduce steps chained together in a single pipeline and always get the same error.
In the case of the attached pdf the error is "org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
Input path does not exist: hdfs:///tmp/crunch-1279941375/p1"
That's the temp directory the crunch planner set up for the first reduce phase.
Can I run multiple chained reduces within the same pipeline? Do I have to manually write out
the output from the first reduce?
Here's what the code looks like:
      // Simple mapper      PTable<String, Pair<Long, Log>> first = Danny.filterForDanny(logs);
     // Secondary sort happens here      PTable<Danny, Long> second = Danny.extractDannys(first);
     // Regular group by      PTable<Danny, Long> third = second.groupByKey().combineValues(Aggregators.SUM_LONGS());
     // simple function that populates some fields in the Danny object with the aggregate
results      PCollection<Pair<Danny, String>> done = Danny.finalize(third);  
   Pair<PCollection<Danny>, PCollection<String>> splits = Channels.split(done);
     splits.second().write(To.textFile(mypath, WriteMode.OVERWRITE);      Target pq_danny
= new AvroParquetFileTarget(pqPath));      splits.first().write(pq_danny, WriteMode.OVERWRITE)

Director of Data ScienceClouderaTwitter: @josh_wills
View raw message