avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AVRO-1976) Add Input/OutputFormat to read/write encoded objects
Date Fri, 18 Aug 2017 13:01:00 GMT

     [ https://issues.apache.org/jira/browse/AVRO-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Sean Busbey updated AVRO-1976:
    Labels: beginner  (was: newbie)

> Add Input/OutputFormat to read/write encoded objects
> ----------------------------------------------------
>                 Key: AVRO-1976
>                 URL: https://issues.apache.org/jira/browse/AVRO-1976
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>         Environment: hadoop
>            Reporter: Marius Posta
>            Assignee: Marius Posta
>            Priority: Minor
>              Labels: beginner
>   Original Estimate: 1h
>  Remaining Estimate: 1h
> In certain cases, performance of some Avro map-reduce jobs can be considerably improved
by de-coupling Avro encoding from actual Avro container file IO.
> In my case, a complex schema (100+ record fields) and large HDFS blocks resulted in Spark
jobs where a lot of workers were idling while a couple of them were busy decoding their input
splits.Furthermore, the objects then needed to be re-encoded in order to be shuffled about,
which crippled performance further.
> I propose the addition of an InputFormat which reads a container file input split as
key-value pairs in which the key is the file header and the value is the decompressed file
data block. Also, an OutputFormat which follows the same logic for writing.

This message was sent by Atlassian JIRA

View raw message