avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1393) SyncInterval logic always causes blocks to be larger than the sync interval
Date Tue, 05 Nov 2013 23:06:17 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814374#comment-13814374

Doug Cutting commented on AVRO-1393:

The sync interval is meant to be a hint, not an upper or lower limit.  It should generally
be much smaller than the HDFS block size.  Rather it should be just big enough that compression
algorithms are effective, that per-buffer overheads are not significant, etc.  64kB is the
default.  It is expected that a typical map task will need to skip half an Avro block at the
beginning of each HDFS block and read half an Avro block past the end of the HDFS block. 
With Avro blocks around 64kB and HDFS blocks around 64MB, this overhead should not be significant.

> SyncInterval logic always causes blocks to be larger than the sync interval
> ---------------------------------------------------------------------------
>                 Key: AVRO-1393
>                 URL: https://issues.apache.org/jira/browse/AVRO-1393
>             Project: Avro
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
> If sync interval in the container file is set to be exactly block size, then the sync
marker will be slightly larger than the block as we check the size of the file only after
writing data to the stream. This means that sync interval is essentially the smallest interval
between sync markers. 
> Since we cannot predict the serialized size of the datum, we can never know how much
data will overflow the block. Whatever the case, this might be more expensive than expected
especially on systems like HDFS.
> Fixing this is difficult without breaking a bunch of interfaces, so opening this jira
for discussion with people with more knowledge of the code.

This message was sent by Atlassian JIRA

View raw message