hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-342) Design/Implement a tool to support archival and analysis of logfiles.
Date Tue, 04 Jul 2006 08:32:30 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-342?page=comments#action_12419061 ] 

Doug Cutting commented on HADOOP-342:

I will look at your patch more closely soon.

I think it would be good, rather than copy the logs into DFS, to use HTTP to retrieve the
map input.  Ideally, map tasks would be assigned to nodes where the log data is local.

This could be implemented as an InputFormat that is parameterized by date.  For example, one
might specify something like:

job.set("log.input.start", "2006-07-13 12:00:00");
job.set("log.input.end", "2006-07-13 15:00:00");

The set of hosts can be determined automatically to be all hosts in the cluster.  One could
also specify a job id, in which case the job's start and end time would be used, or a start
job id and end job id.

We might implement parts of this by enhancing the web server run on each tasktracker, e.g.,
to directly support access to logs by date range.

Does this make sense?

> Design/Implement a tool to support archival and analysis of logfiles.
> ---------------------------------------------------------------------
>          Key: HADOOP-342
>          URL: http://issues.apache.org/jira/browse/HADOOP-342
>      Project: Hadoop
>         Type: New Feature

>     Reporter: Arun C Murthy
>  Attachments: logalyzer.patch
> Requirements:
>   a) Create a tool support archival of logfiles (from diverse sources) in hadoop's dfs.
>   b) The tool should also support analysis of the logfiles via grep/sort primitives.
The tool should allow for fairly generic pattern 'grep's and let users 'sort' the matching
lines (from grep) on 'columns' of their choice.
>   E.g. from hadoop logs: Look for all log-lines with 'FATAL' and sort them based on timestamps
(column x)  and then on column y (column x, followed by column y).
> Design/Implementation:
>   a) Log Archival
>     Archival of logs from diverse sources can be accomplished using the *distcp* tool
>   b) Log analysis
>     The idea is to enable users of the tool to perform analysis of logs via grep/sort
>     This can be accomplished via a relatively simple Map-Reduce task where the map does
the *grep* for the given pattern via RegexMapper and then the implicit *sort* (reducer) is
used with a custom Comparator which performs the user-specified comparision (columns). 
>     The sort/grep specs can be fairly powerful by letting the user of the tool use java's
in-built regex patterns (java.util.regex).

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message