hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Willson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11678) AvroSerializer buffers output in violation of contract for Serializer
Date Thu, 05 Mar 2015 15:57:42 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348978#comment-14348978

Matthew Willson commented on HADOOP-11678:

Also note that this is not the same as the avro serialization in the separate avro-mapred
library (org.apache.avro.mapred.AvroSerialization), which is implemented separately for AvroWrappers
and is probably in more common usage.

In fact I'm not entirely clear why this code (org.apache.hadoop.io.serializer.avro.AvroSerialization)
is in the hadoop project / why hadoop-common needs to have an avro dependency at all, given
that there's a separate artifact in the avro project (avro-mapred) to get avro working in
hadoop. Perhaps avro is used internally somewhere?

> AvroSerializer buffers output in violation of contract for Serializer
> ---------------------------------------------------------------------
>                 Key: HADOOP-11678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11678
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Matthew Willson
> We've had issues with the deserializer running into EOFException when using Cascading's
TupleSerialization (which delegates to other hadoop serializers to serialize entries within
its tuples) in combination with AvroSerialization.
> Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is buffering
output (since it uses a buffering EncoderFactory#binaryEncoder rather than a non-buffering
> https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105
> The contract for Serializer explicitly states "Serializers ... must not buffer the output
since other producers may write to the output between calls to #serialize(Object)." TupleSerialization
does exactly that (write to the output between calls to #serialize), hence our problem.
> There's a similar problem with the AvroDeserializer too -- it uses a buffering binaryDecoder,
and this can consume the underlying InputStream beyond the end of the datum it's decoding,
meaning that if a different Deserializer is used to read the next item, it'll start off in
the wrong place and get confused.
> Switching AvroSerializer and AvroDeserializer to use the non-buffering `EncoderFactory#directBinaryEncoder`
and `DecoderFactory#directBinaryDecoder` fixes the issue for us.

This message was sent by Atlassian JIRA

View raw message