avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1307) Add an avro-tool to extract samples from avro files
Date Fri, 26 Apr 2013 00:14:16 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642410#comment-13642410

Doug Cutting commented on AVRO-1307:

Vincenz, this looks great.  I'll commit this soon unless someone objects.
> Add an avro-tool to extract samples from avro files
> ---------------------------------------------------
>                 Key: AVRO-1307
>                 URL: https://issues.apache.org/jira/browse/AVRO-1307
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>         Environment: java
>            Reporter: Vincenz Priesnitz
>            Priority: Minor
>         Attachments: AVRO-1307-addedUnitTests-fixed.patch, AVRO-1307.patch
> It would be nice to have an avro-tool that picks only some records from avro files.
> I implemented a new avro-tool cat, which takes a list of avro files with identical schemas
and concatenates them into a single file, with options to discard the first n records, to
limit the output size and to collect records at a certain samplerate.
> This tool allows a quicker peek into large avro files, e.g.:
> {code}
> java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10
> # creates output.avro that contains records
> # 51 to 60 from input.avro.
> {\code}
> {code}
> java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 --samplerate
> # samples every hundredth record from input,
> # beginning at the 1000th record and limiting
> # the output to 100 records. 
> {\code}
> The tool allows multiple input files or folders, in which case all files inside the folder
will be used for input.
> {code}
> java -jar avro-tools.jar cat data_folder output.avro --samplerate .01
> # reads all the files from the data folder and
> # writes every 100th record into the output file.
> {\code}
> This tool uses the hadoop FileSystem api to handle files from any supported filesystem.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message