hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Willson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-11678) AvroSerializer buffers output in violation of contract for Serializer
Date Thu, 05 Mar 2015 15:37:38 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matthew Willson updated HADOOP-11678:
-------------------------------------
    Description: 
We've had issues with the deserializer running into EOFException when using Cascading's TupleSerialization
(which delegates to other hadoop serializers to serialize entries within its tuples) in combination
with AvroSerialization.

Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is buffering
output (since it uses a buffering EncoderFactory#binaryEncoder rather than a non-buffering
EncoderFactory#directBinaryEncoder):

https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105

The contract for Serializer explicitly states "Serializers ... must not buffer the output
since other producers may write to the output between calls to #serialize(Object)." TupleSerialization
does exactly that (write to the output between calls to #serialize), hence our problem.

There's a similar problem with the AvroDeserializer too -- it uses a buffering binaryDecoder,
and this can consume the underlying InputStream beyond the end of the datum it's decoding,
meaning that if a different Deserializer is used to read the next item, it'll start off in
the wrong place and get confused.

Switching AvroSerializer and AvroDeserializer to use the non-buffering `EncoderFactory#directBinaryEncoder`
and `DecoderFactory#directBinaryDecoder` fixes the issue for us.



  was:
We've had issues with the deserializer running into EOFException when using Cascading's TupleSerialization
(which delegates to other hadoop serializers to serialize entries within its tuples) in combination
with AvroSerialization.

Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is buffering
output (since it uses a buffering BinaryEncoder rather than a non-buffering directBinaryEncoder):

https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105

The contract for Serializer explicitly states "Serializers ... must not buffer the output
since other producers may write to the output between calls to #serialize(Object)."

TupleSerialization does exactly that (write to the output between calls to #serialize), hence
our problem.

Switching it to use the non-buffering `EncoderFactory#directBinaryEncoder` and `DecoderFactory#directBinaryDecoder`
fixes the issue for us.


> AvroSerializer buffers output in violation of contract for Serializer
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-11678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11678
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Matthew Willson
>
> We've had issues with the deserializer running into EOFException when using Cascading's
TupleSerialization (which delegates to other hadoop serializers to serialize entries within
its tuples) in combination with AvroSerialization.
> Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is buffering
output (since it uses a buffering EncoderFactory#binaryEncoder rather than a non-buffering
EncoderFactory#directBinaryEncoder):
> https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105
> The contract for Serializer explicitly states "Serializers ... must not buffer the output
since other producers may write to the output between calls to #serialize(Object)." TupleSerialization
does exactly that (write to the output between calls to #serialize), hence our problem.
> There's a similar problem with the AvroDeserializer too -- it uses a buffering binaryDecoder,
and this can consume the underlying InputStream beyond the end of the datum it's decoding,
meaning that if a different Deserializer is used to read the next item, it'll start off in
the wrong place and get confused.
> Switching AvroSerializer and AvroDeserializer to use the non-buffering `EncoderFactory#directBinaryEncoder`
and `DecoderFactory#directBinaryDecoder` fixes the issue for us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message