hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Willson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-11678) AvroSerializer buffers output in violation of contract for Serializer
Date Thu, 05 Mar 2015 13:01:38 GMT
Matthew Willson created HADOOP-11678:
----------------------------------------

             Summary: AvroSerializer buffers output in violation of contract for Serializer
                 Key: HADOOP-11678
                 URL: https://issues.apache.org/jira/browse/HADOOP-11678
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 2.6.0
            Reporter: Matthew Willson


We've had issues with the deserializer running into EOFException when using Cascading's TupleSerialization
(which delegates to other hadoop serializers to serialize entries within its tuples) in combination
with AvroSerialization.

Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is buffering
output (since it uses an avro BinaryEncoder):

https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105

The contract for Serializer explicitly states "Serializers ... must not buffer the output
since other producers may write to the output between calls to #serialize(Object)."

TupleSerialization does exactly that (write to the output between calls to #serialize), hence
our problem.

It's not sufficient just to flush the encoder on close, it needs to be flushed after every
write. Doing this fixes our issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message