avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Janosch Woschitz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1720) Add an avro-tool to count records in an avro file
Date Fri, 21 Jul 2017 10:51:00 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096128#comment-16096128

Janosch Woschitz commented on AVRO-1720:

Just as a follow-up since there seems to be not much progress on this ticket: for the moment
I made a separate binary available which allows efficient and convenient counting of records
contained in a single avro file or in a folder containing several avro files.

The binaries, documentation and source are available here: https://github.com/jwoschitz/avrocount

This project tries to fill this gap (at least) until a similar functionality is provided by
avro-tools. Over time there were also several improvements to this project in comparison to
the original patch.

It would be great if these improvements would also find a way back into the Apache Avro project
in the longterm. Until then this project can be used in addition to the currently existing

> Add an avro-tool to count records in an avro file
> -------------------------------------------------
>                 Key: AVRO-1720
>                 URL: https://issues.apache.org/jira/browse/AVRO-1720
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Janosch Woschitz
>            Priority: Minor
>              Labels: starter
>         Attachments: AVRO-1720.patch, AVRO-1720-with-extended-unittests.patch
> If you're dealing with bigger avro files (>100MB) it would be nice to have a way to
quickly count the amount of records contained within that file.
> With the current state of avro-tools the only way to achieve this (to my current knowledge)
is to dump the data to json and count the amount of records. For bigger files this might take
a while due to the serialization overhead and since every record needs to be looked at.
> I added a new tool which is optimized for counting records, it does not serialize the
records and reads only the block count for each block.
> {panel:title=Naive benchmark}
> {noformat}
> # the input file had a size of ~300MB
> $ du -sh sample.avro 
> 323M    sample.avro
> # using the new count tool
> $ time java -jar avro-tools.jar count sample.avro
> 331439
> real    0m4.670s
> user    0m6.167s
> sys 0m0.513s
> # the current way of counting records
> $ time java -jar avro-tools.jar tojson sample.avro | wc
> 331439 54904484 1838231743
> real    0m52.760s
> user    1m42.317s
> sys 0m3.209s
> # the overhead of wc is rather minor
> $ time java -jar avro-tools.jar tojson sample.avro > /dev/null
> real    0m47.834s
> user    0m53.317s
> sys 0m1.194s
> {noformat}
> {panel}
> This tool uses the HDFS API to handle files from any supported filesystem. I added the
unit tests to the already existing TestDataFileTools since it provided convenient utility
functions which I could reuse for my test scenarios.

This message was sent by Atlassian JIRA

View raw message