avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-684) Java tool for altering the codec of an Avro data file stream.
Date Mon, 25 Oct 2010 20:34:22 GMT

    [ https://issues.apache.org/jira/browse/AVRO-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924710#action_12924710
] 

Scott Carey commented on AVRO-684:
----------------------------------

Yes this would be useful.

Most of the machinery for this is already in the DataFileWriter class.  It is not exposed
in a command-line tool though.

I currently use this machinery to take a large list of small avro files and merge them into
one larger avro file with a set compression type and level.

In addition to the compression level, there is the concept of forcing a re-encode.  By default,
the current code will not re-encode unless required.  Therefore, it won't re-encode deflate:1
to deflate:3 by default unless told to by passing in the flag to force it to re-encode.  By
default it will decode deflate to null or encode null to deflate.   If a block is already
compatible, it just copies the raw bytes of the block, which is very fast.

This tool should also support concatenation of files and creation of one larger file from
a collection of smaller ones (of the same schema) with the requested encoding.  Maybe something
like this:

{noformat}
$ avro-tools append_to -f outfile.avro -c deflate:5 infile.avro [infile2.avro, . . .]
{noformat}

Which would create outfile.avro with codec deflate:5 form multiple source files.


> Java tool for altering the codec of an Avro data file stream.
> -------------------------------------------------------------
>
>                 Key: AVRO-684
>                 URL: https://issues.apache.org/jira/browse/AVRO-684
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Patrick Linehan
>
> An example is worth a thousand words:
>   cat infile.avro | avro-tools recodec deflate - - > outfile.avro
> The above example would create a new file, "outfile.avro", with the same contents as
"infile.avro".  However, the codec of "outfile.avro" would be "deflate", regardless of the
codec of "infile.avro".
> Proposed features:
> * The tool should preserve any metadata present in the input file.
> * Supported codecs will be "deflate" and "null".
> * Optionally add support for specifying the deflation level, perhaps with syntax as follows:
 "deflate:N" where N is the deflation level, e.g. "deflate:4".
> Does this proposal sound reasonable?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message