hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guanying Wang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-778) Need a standalone JobHistory log anonymizer
Date Thu, 01 Apr 2010 03:22:27 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Guanying Wang updated MAPREDUCE-778:

    Attachment: same.py

An anonymizer implemented in Python attached. This anonymizer can work with v20, v22, or rumen
log files. On doing anonymization, a private file with tables is created, and can be used
to de-anonymize the anonymized trace. The tables file can be used in two ways, either grown
incrementally or stand alone, when working with multiple traces.

Another file attached same.py is a simple Python script to compare two json-based trace files.
It works similar to diff. Because json objects can be semantically equivalent even if keys
in dictionaries are in different orders, so running diff directly on two files may not work
as desired. It outputs nothing if the two files represent the same trace, otherwise print
the objects (which can be big anyway) that are different in the two files. v22 and rumen log
files can be compared using this script. Keys in v20 script have fixed orders so v20 log files
can be compared using diff directly.

Known issues:

1. In v22 and rumen-trace log files, multiple json objects are in one file, and separate by
white spaces. Without the power of Java Jackson package, the Python json module can only load
a json object from a string or a file.  Currently, the scripts rely on detecting "}\n" as
a whole line to determine ending of a json object. That may fail if the particular pattern
occurs in a string object. A better implementation is similar to what Java Jackson does. An
object should be found from a file, leaving the rest of the file still operational for further

2. Sample rumen-trace and rumen-topology files are got from hadoop-mapreduce/src/test/tools/data/rumen/.
These sample files seem to be generated from v20 log files, since "." are escaped as "\."
in many fields. I'm not sure if rumen works with v22 log files, and if there are differences
between rumen files generated from v22 log files and rumen files generated from v20 log files.

> Need a standalone JobHistory log anonymizer
> -------------------------------------------
>                 Key: MAPREDUCE-778
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-778
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
>         Attachments: anonymizer.py, same.py
> Job history logs contain a rich set of information that can help understand and characterize
cluster workload and individual job execution. Examples of work that parses or utilizes job
history include HADOOP-3585, MAPREDUCE-534, HDFS-459, MAPREDUCE-728, and MAPREDUCE-776. Some
of the parsing tools developed in previous work already contains a component to anonymize
the logs. It would be nice to combine these effort and have a common standalone tool that
can anonymizes job history logs and preserve much of the structure of the files so that existing
tools on top of job history logs continue work with no modification.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message