hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-778) Need a standalone JobHistory log anonymizer
Date Fri, 02 Apr 2010 09:06:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852754#action_12852754
] 

Hong Tang commented on MAPREDUCE-778:
-------------------------------------

Guanying, thanks for taking the effort.

Although it seems versatile to have the tool to parse all types of formats, I am concerned
the effort of maintaining such versatility may outweight its potential usefulness. I think
it is more preferable to implement the tool on top of Rumen API (and probably as part of Rumen).
There are a number of reasons why this makes sense:

- As we discovered in Rumen development, parsing job history is not trivial and the format
could continue evolving in the near future (the data model is not cleanly defined as-is IMO,
see MAPREDUCE-1175). So it is advantageous to let Rumen be the only module to interface with
different variations of job history format and present the common abstraction of job history.
- Job history contains more than the basic information about job execution, it also contains
things like status string, and counters, etc and we have lesser control of what fields may
be added into job history logs over time. So it would be a challenging task to keep the anonymizer
up to date with high confidence that it would not leak any private  info. On the other side,
since Rumen only extracts a subset of info from the job history logs, we can easily enumerate
every fields of Rumen json objects to be sure any sensitive fields are anonymized.
- Currently some info we want wrt job execution are only available in jobconf xml file (such
as queue name), rumen does the job of combining them together, and building the anonymizer
on top of rumen saves the effort of having to have another configuration parser.

Other comments:
- I like the idea of storing a private "lookup table" and keep the capability of "de-anonymize"
the trace if we choose to.
- The coverage of anonymization fields and the way how they are anonymized looks good to me.
(Need to add the "queue" entity and I do not think we need "/path" type.)


> Need a standalone JobHistory log anonymizer
> -------------------------------------------
>
>                 Key: MAPREDUCE-778
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-778
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
>         Attachments: anonymizer.py, same.py
>
>
> Job history logs contain a rich set of information that can help understand and characterize
cluster workload and individual job execution. Examples of work that parses or utilizes job
history include HADOOP-3585, MAPREDUCE-534, HDFS-459, MAPREDUCE-728, and MAPREDUCE-776. Some
of the parsing tools developed in previous work already contains a component to anonymize
the logs. It would be nice to combine these effort and have a common standalone tool that
can anonymizes job history logs and preserve much of the structure of the files so that existing
tools on top of job history logs continue work with no modification.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message