hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-6107) Have some log messages designed for machine parsing, either real-time or post-mortem
Date Thu, 25 Jun 2009 13:01:08 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724076#action_12724076
] 

Steve Loughran edited comment on HADOOP-6107 at 6/25/09 5:59 AM:
-----------------------------------------------------------------

as examples of the problem, some client side logs

{code}
  [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:41:10 INFO mapred.JobClient: Task Id : attempt_200906251314_0002_r_000001_0,
Status : FAILED
     [java] Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:44:07 INFO mapred.JobClient: Task Id : attempt_200906251314_0002_m_000004_0,
Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:44:11 INFO mapred.JobClient:  map 83% reduce 0%
     [java] 09/06/25 13:44:14 INFO mapred.JobClient:  map 100% reduce 0%
     [java] 09/06/25 13:49:23 INFO mapred.JobClient: Task Id : attempt_200906251314_0002_m_000005_0,
Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:49:27 INFO mapred.JobClient:  map 83% reduce 0%
{code}

# bad spacing in the " Error reading task outputConnection refused" message. 
# not enough context as to why the connection was being refused: need to include the (hostname,
port) details -which would change the message and break chukwa
# no stack trace in the connection refused message
# not enough context in the JobClient messages; if >1 job is running simultaneously, you
cant determine what the map and reduce is referring to 
# The shuffle error doesn't actually say what the MAX_FAILED_UNIQUE_FETCHES value is. 

      was (Author: steve_l):
    as examples of the problem, some client side logs

{{code}
  [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:41:10 INFO mapred.JobClient: Task Id : attempt_200906251314_0002_r_000001_0,
Status : FAILED
     [java] Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:44:07 INFO mapred.JobClient: Task Id : attempt_200906251314_0002_m_000004_0,
Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:44:11 INFO mapred.JobClient:  map 83% reduce 0%
     [java] 09/06/25 13:44:14 INFO mapred.JobClient:  map 100% reduce 0%
     [java] 09/06/25 13:49:23 INFO mapred.JobClient: Task Id : attempt_200906251314_0002_m_000005_0,
Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task outputConnection refused
     [java] 09/06/25 13:49:27 INFO mapred.JobClient:  map 83% reduce 0%
{code}

# bad spacing in the " Error reading task outputConnection refused" message. 
# not enough context as to why the connection was being refused: need to include the (hostname,
port) details -which would change the message and break chukwa
# no stack trace in the connection refused message
# not enough context in the JobClient messages; if >1 job is running simultaneously, you
cant determine what the map and reduce is referring to 
# The shuffle error doesn't actually say what the MAX_FAILED_UNIQUE_FETCHES value is. 
  
> Have some log messages designed for machine parsing, either real-time or post-mortem
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6107
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6107
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>
> Many programs take the log output of bits of Hadoop, and try and parse it. Some may also
put their own back end behind commons-logging, to capture the input without going via Log4J,
so as to keep the output more machine-readable.
> These programs need log messages that
> # are easy to parse by a regexp or other simple string parse  (consider quoting values,
etc)
> # push out the full exception chain rather than stringify() bits of it
> # stay stable across versions
> # log the things the tools need to analyse: events, data volumes, errors
> For these logging tools, ease of parsing, retention of data and stability over time take
the edge over readability. In HADOOP-5073, Jiaqi Tan proposed marking some of the existing
log events as evolving towards stability. As someone who regulary patches log messages to
improve diagnostics, this creates a conflict of interest. For me, good logs are ones that
help people debug their problems without anyone else helping, and if that means improving
the text, so be it. Tools like Chukwa have a different need. 
> What to do? Some options
>  # Have some messages that are designed purely for other programs to handle
>  # Have some logs specifically for machines, to which we log alongside the human-centric
messages
>  # Fix many of the common messages, then leave them alone.
>  # Mark log messages to be left alone (somehow)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message