avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1215) AvroMultipleOutputs not working when specifying baseOutputPath
Date Sat, 08 Dec 2012 00:41:22 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526932#comment-13526932
] 

Harsh J commented on AVRO-1215:
-------------------------------

Since we're providing a custom implementation of MultipleOutputs here, we do not need to be
overly concerned about extending its API.

For instance, though the Hadoop MR MO has these write(K, V, P) APIs, it had no notion of schemas.
Avro MR MO can provide, on top of this default-pulling APIs, a schema-providing API such as
write(K, V, Schema, P). I see no harm in doing that since Avro is all schema-dependent and
this may be more useful, than relying on the default job output schema automatically.
                
> AvroMultipleOutputs not working when specifying baseOutputPath
> --------------------------------------------------------------
>
>                 Key: AVRO-1215
>                 URL: https://issues.apache.org/jira/browse/AVRO-1215
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.2
>            Reporter: Matthew Hayes
>            Assignee: Ashish Nagavaram
>              Labels: avro, mapreduce
>         Attachments: avro-1215.patch
>
>
> I'm calling the write() method of AvroMultipleOutputs which takes the baseOutputPath.
 The reducer appears to begin hanging once it tries writing to a baseOuputPath value not already
encountered.  It then fails with:
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file ... because current leaseholder is trying to recreate file.
> I think the problem has to do with this line in AvroMultipleOutputs:
> {code}
> // get the record writer from context output format
> //FileOutputFormat.setOutputName(taskContext, baseFileName);
> {code}
> This line is not commented out in the similar code from Hadoop.  So I think the baseOutputPath
is ignored.  As a result when each record writer is created it uses the same path, leading
to the exception.
> Uncommenting this line does not work because of visibility of the method.  However what
this method does is set "mapreduce.output.basename".  But setting this doesn't work either.
 
> After digging through Avro code I found that AvroOutputFormatBase is using "avro.mo.config.namedOutput"
to create the path.  If I replace the commented out line with this it seems to work:
> {code}
> taskContext.getConfiguration().set("avro.mo.config.namedOutput", baseFileName);  
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message