avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Wikieł (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (AVRO-1862) AvroOutputFormat saves compressed avrò files without respecting codec's default extension
Date Thu, 21 Jul 2016 16:35:20 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388004#comment-15388004
] 

Piotr Wikieł edited comment on AVRO-1862 at 7/21/16 4:34 PM:
-------------------------------------------------------------

[~mike.hurley] let me explain you how I use it. I have a piece of stuff to concatenate multiple
small avro files (produced by Kafka-HDFS pipeline) into one big file. It speeds up reading
Hive partition with those files for about ~40%. You can set output compression as a parameter.


Kafka-HDFS ingestion tool we use is Camus - it stores avro files with deflate codec which
is (in one of few scenarios) also our output compression. We also don't want to run concatenation
for files within already concatenated directories and, at the same time, support late records.
The simplest way is to rely on extension. Camus stores files with {{.avro}} extension, so
we must use different one (we calculate size of files with {{.deflate.avro}} extension and
if it is > 0, we run concatenation). 

I know that there are many ways to achieve such a goal but I also think that backward compatible,
disabled by default feature that do not change everything around (in code) could be accepted
without a harm to the project because it could be useful not only for me. But I don't have
any problem if you don't accept a patch if you convince me in some way ;)

If you want to see code of the tool I've mentioned, here it is: https://github.com/allegro/camus-compressor

Cheers! :)


was (Author: wikp):
[~mike.hurley] let me explain you how I use it. I have a piece of stuff to concatenate multiple
small avro files (produced by Kafka-HDFS pipeline) into one big file. It speeds up reading
Hive partition with those files for about ~40%. You can set output compression as a parameter.


Kafka-HDFS ingestion tool we use is Camus - it stores avro files with deflate codec which
is (in one of few scenarios) also our output compression. We also don't want to run concatenation
for files within already concatenated directories and, at the same time, support late records.
The simplest way is to rely on extension. Camus stores files with {{.avro}} extension, so
we must use different one (we calculate size of files with {{.deflate.avro}} extension and
if it is > 0, we run concatenation). 

I know that there are many ways to achieve such a goal but I also think that backward compatible,
disabled by default feature that do not change everything around (in code) could be accepted
without a harm to the project because it could be useful not only for me.

If you want to see code of the tool I've mentioned, here it is: https://github.com/allegro/camus-compressor

Cheers! :)

> AvroOutputFormat saves compressed avrò files without respecting codec's default extension
> -----------------------------------------------------------------------------------------
>
>                 Key: AVRO-1862
>                 URL: https://issues.apache.org/jira/browse/AVRO-1862
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Piotr Wikieł
>            Priority: Minor
>         Attachments: AVRO-1862-1.patch, AVRO-1862.patch
>
>
> Common pattern in naming compressed files is giving them extension derived from compression
codec, for example: {{.gz}}, {{.zip}}, {{.bz2}}. 
> {{AvroOutputFormat}} currently does not respect this convention. 
> I've adapted some code from Hadoop's {{TextOutputFormat}} in backward-compatible manner
adding following {{JobConf}} property:
> {{avro.mapred.output.extension.from-codec}} ({{boolean}}, default: {{false}}) - when
set to {{true}}, extension will be changed according to above rule.
> EDIT: Please take a look at first comment for an update. {{.gz.avro}}, {{.snappy.avro}}
will be an extension of the file when above property will be set to true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message