avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AVRO-1976) Add Input/OutputFormat to read/write encoded objects
Date Fri, 18 Aug 2017 13:01:00 GMT

     [ https://issues.apache.org/jira/browse/AVRO-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Busbey updated AVRO-1976:
------------------------------
    Labels: beginner  (was: newbie)

> Add Input/OutputFormat to read/write encoded objects
> ----------------------------------------------------
>
>                 Key: AVRO-1976
>                 URL: https://issues.apache.org/jira/browse/AVRO-1976
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>         Environment: hadoop
>            Reporter: Marius Posta
>            Assignee: Marius Posta
>            Priority: Minor
>              Labels: beginner
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In certain cases, performance of some Avro map-reduce jobs can be considerably improved
by de-coupling Avro encoding from actual Avro container file IO.
> In my case, a complex schema (100+ record fields) and large HDFS blocks resulted in Spark
jobs where a lot of workers were idling while a couple of them were busy decoding their input
splits.Furthermore, the objects then needed to be re-encoded in order to be shuffled about,
which crippled performance further.
> I propose the addition of an InputFormat which reads a container file input split as
key-value pairs in which the key is the file header and the value is the decompressed file
data block. Also, an OutputFormat which follows the same logic for writing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message