hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11588) Output Avro format in the offline editlog viewer
Date Wed, 29 Mar 2017 14:11:41 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Sean Busbey updated HDFS-11588:
    Issue Type: New Feature  (was: Bug)

> Output Avro format in the offline editlog viewer
> ------------------------------------------------
>                 Key: HDFS-11588
>                 URL: https://issues.apache.org/jira/browse/HDFS-11588
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: tools
>            Reporter: Haohui Mai
>            Assignee: Haohui Mai
> We found that it is handy to import the edit logs into query engines (e.g., Hive / Presto)
to understand the usages of the cluster. Some examples include:
> * The size of the data and the number of files that are written into a directory
> * The distribution of the operations, for different directories.
> * The number of files that are created by a user.
> The answers to the above questions give insights on the usages of the clusters and have
significant values on capacity planning.
> Importing the edit log into query engines simplifies the tasks of answering these questions,
and they can be answered efficiently.
> While the Offline Editlog Viewer (OEV) supports outputting editlogs in XML formats, we
found that it is time-consuming to transforming the XML format to formats that query engines
recognize, because the generating the editlogs in XML formats and transforming them into formats
that the query engine understands takes significant amount of time. In our environment it
takes minutes to prepare a 100MB editlog file into a corresponding Parquet file.
> This jira proposes to extend the OEV to output Avro files to make this process efficient.
As an internal tool, the Avro output format has certain pre-defined schemas but it does not
have the constraint of maintaining backward compatibility of the output, which is similar
to the XML output format.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message