avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nandor Kollar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AVRO-2109) Reset buffers in case of IOException
Date Mon, 04 Dec 2017 14:21:01 GMT

     [ https://issues.apache.org/jira/browse/AVRO-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nandor Kollar updated AVRO-2109:
--------------------------------
    Description: 
In case of an {{IOException}} is thrown out from {{DataFileWriter.writeBlock}} the {{buffer}}
and {{blockCount}} are not reset therefore duplicated data is written out when {{close}}/{{flush}}.

This is actually a conceptual question whether we should reset the buffer or not in case of
an exception. In case of an exception occurs during writing the file we shall expect that
the file will be corrupt. So, the possible duplication of data shall not matter.
In the other hand if the file is already corrupt why would we try to write anything again
at file close?

This issue comes from a Flume issue where the HDFS wait thread is interrupted because of a
timeout during writing an Avro file. The actual block is properly written already but because
of the {{IOException}} caused by the thread interrupt we invoke {{close()}} on the writer
which writes the block again with some other stuff (maybe duplicated sync marker) that makes
the file corrupt.

[~busbey], [~nkollar], [~zi], any thoughts?

  was:
In case of an {{IOException}} is thrown out from {{DataFileWriter.writeBlock}} the {{buffer}}
and {[blockCount}} are not reset therefore duplicated data is written out when {{close}}/{{flush}}.

This is actually a conceptual question whether we should reset the buffer or not in case of
an exception. In case of an exception occurs during writing the file we shall expect that
the file will be corrupt. So, the possible duplication of data shall not matter.
In the other hand if the file is already corrupt why would we try to write anything again
at file close?

This issue comes from a Flume issue where the HDFS wait thread is interrupted because of a
timeout during writing an Avro file. The actual block is properly written already but because
of the {{IOException}} caused by the thread interrupt we invoke {{close()}} on the writer
which writes the block again with some other stuff (maybe duplicated sync marker) that makes
the file corrupt.

[~busbey], [~nkollar], [~zi], any thoughts?


> Reset buffers in case of IOException
> ------------------------------------
>
>                 Key: AVRO-2109
>                 URL: https://issues.apache.org/jira/browse/AVRO-2109
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.8.2
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>
> In case of an {{IOException}} is thrown out from {{DataFileWriter.writeBlock}} the {{buffer}}
and {{blockCount}} are not reset therefore duplicated data is written out when {{close}}/{{flush}}.
> This is actually a conceptual question whether we should reset the buffer or not in case
of an exception. In case of an exception occurs during writing the file we shall expect that
the file will be corrupt. So, the possible duplication of data shall not matter.
> In the other hand if the file is already corrupt why would we try to write anything again
at file close?
> This issue comes from a Flume issue where the HDFS wait thread is interrupted because
of a timeout during writing an Avro file. The actual block is properly written already but
because of the {{IOException}} caused by the thread interrupt we invoke {{close()}} on the
writer which writes the block again with some other stuff (maybe duplicated sync marker) that
makes the file corrupt.
> [~busbey], [~nkollar], [~zi], any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message