hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikhail Bernadsky (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-10669) Avro serialization does not flush buffered serialized values causing data lost
Date Mon, 09 Jun 2014 00:14:01 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mikhail Bernadsky updated HADOOP-10669:
---------------------------------------

    Description: 
Found this debugging Nutch. 

MapTask serializes keys and values to the same stream, in pairs: 

keySerializer.serialize(key); 
..... 
valSerializer.serialize(value);
 ..... 
bb.write(b0, 0, 0); 

AvroSerializer does not flush its buffer after each serialization. So if it is used for valSerializer,
the values are only partially written or not written at all to the output stream before the
record is marked as complete (the last line above).

<EDIT> Added HADOOP-10699_all.patch. This is a less intrusive fix, as it does not try
to flush MapTask stream. Instead, we write serialized values directly to MapTask stream and
avoid using a buffer on avro side. 

  was:
Found this debugging Nutch. 

MapTask serializes keys and values to the same stream, in pairs: 

keySerializer.serialize(key); 
..... 
valSerializer.serialize(value);
 ..... 
bb.write(b0, 0, 0); 

AvroSerializer does not flush its buffer after each serialization. So if it is used for valSerializer,
the values are only partially written or not written at all to the output stream before the
record is marked as complete (the last line above).


> Avro serialization does not flush buffered serialized values causing data lost
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-10669
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10669
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 2.4.0
>            Reporter: Mikhail Bernadsky
>         Attachments: HADOOP-10669.patch, HADOOP-10669_alt.patch
>
>
> Found this debugging Nutch. 
> MapTask serializes keys and values to the same stream, in pairs: 
> keySerializer.serialize(key); 
> ..... 
> valSerializer.serialize(value);
>  ..... 
> bb.write(b0, 0, 0); 
> AvroSerializer does not flush its buffer after each serialization. So if it is used for
valSerializer, the values are only partially written or not written at all to the output stream
before the record is marked as complete (the last line above).
> <EDIT> Added HADOOP-10699_all.patch. This is a less intrusive fix, as it does not
try to flush MapTask stream. Instead, we write serialized values directly to MapTask stream
and avoid using a buffer on avro side. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message