hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Noguchi (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-1238) mapred metrics shows negative count of waiting maps and reduces
Date Thu, 15 Mar 2012 22:52:37 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Koji Noguchi updated MAPREDUCE-1238:
------------------------------------

    Attachment: MAPREDUCE-1238-v0.20-1.patch

This is not my patch but was pointed out  internally by a dev but nobody followed up.  Uploading
here to see if this makes sense.

Copy&Pasting his comment.
{noformat} 
I tried to reproduce this on a small cluster (60 nodes) with hadoop 0.20.202.

Steps to reproduce the issue:
===============================
1. Setup file sink for jobtracker, so that we can get waiting_maps counter in a
separate file
2. Have queue configs similar to that of production (i took nitroblue config)
3. Submit a simple sort job with -Dmapred.job.queue.name="search_general". This
queue should not be present in the cluster. Now, the waiting_maps would get
into -ve value. Example is given below.

1298346748441 mapred.jobtracker: context=mapred, sessionId=,
hostName=gsta90014.tan.ygrid.yahoo.com, waiting_maps=-120, waiting_reduces=-16,
jobs_failed=1, jobs_preparing=0

Problem:
========
1. WaitingMaps are incremented in JobInProgress.initTasks(). If a user gets an
exception even before tasks are initialized, JobInProgress decrements the
waiting_maps wrongly in garbageCollect(). This causes -ve values in
waiting_maps and waiting_reduces.

Tried the following code change in JobInProgress for fixing:
============================================================
//check if tasks are initialized, and decrement waiting_maps accordingly.
if (tasksInited) {
        // Let the JobTracker know that a job is complete
        jobtracker.getInstrumentation().decWaitingMaps(getJobID(),
pendingMaps());
        jobtracker.getInstrumentation().decWaitingReduces(getJobID(),
pendingReduces());
      }

Need to check with dev for reviewing the above logic. 

JobTracker logs when the problem was observed:
==============================================
11/02/22 03:52:22 INFO ipc.Server: SASL server context established. Negotiated
QoP is auth
11/02/22 03:52:22 INFO ipc.Server: SASL server successfully authenticated
client: gridperf@DEV.YGRID.YAHOO.COM
11/02/22 03:52:22 INFO ipc.Server: Auth successfull for
gridperf@DEV.YGRID.YAHOO.COM
11/02/22 03:52:22 INFO authorize.ServiceAuthorizationManager: Authorization
successfull for gridperf@DEV.YGRID.YAHOO.COM for protocol=interface
org.apache.hadoop.mapred.JobSubmissionProtocol
11/02/22 03:52:23 INFO token.DelegationTokenRenewal: registering token for
renewal for service =98.138.162.177:8020 and jobID = job_201102220351_0001
11/02/22 03:52:23 INFO mapred.JobInProgress: job_201102220351_0001: nMaps=120
nReduces=16 max=200000
11/02/22 03:52:23 INFO hdfs.DFSClient: Renewing HDFS_DELEGATION_TOKEN token
1940 for gridperf on 98.138.162.177:8020
11/02/22 03:52:23 INFO mapred.JobInProgress$JobSummary:
jobId=job_201102220351_0001,submitTime=1298346743640,launchTime=0,,finishTime=1298346743724,numMaps=0,numSlotsPerMap=1,numReduces=0,numSlotsPerReduce=1,user=gridperf,queue=search_general,status=FAILED,mapSlotSeconds=0,reduceSlotsSeconds=0,clusterMapCapacity=0,clusterReduceCapacity=0
11/02/22 03:52:23 INFO mapred.JobHistory: No file for job-history with
job_201102220351_0001 found in cache!
11/02/22 03:52:23 INFO mapred.JobHistory: No file for jobconf with
job_201102220351_0001 found in cache!
11/02/22 03:52:23 INFO hdfs.DFSClient: Cancelling HDFS_DELEGATION_TOKEN token
1940 for gridperf on 98.138.162.177:8020
11/02/22 03:52:23 INFO ipc.Server: IPC Server handler 1 on 8021, call
submitJob(job_201102220351_0001,
hdfs://gsta90013.tan.ygrid.yahoo.com/grid/0/daytona/hadoop/tmp/mapred/staging/gridperf/.staging/job_201102220351_0001,
org.apache.hadoop.security.Credentials@7051630a) from 98.138.162.177:45951:
error: java.io.IOException: Queue "search_general" does not exist
java.io.IOException: Queue "search_general" does not exist
    at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3930)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1380)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1378)
{noformat}
                
> mapred metrics shows negative count of waiting maps and reduces 
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1238
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1238
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>            Reporter: Ramya Sunil
>         Attachments: MAPREDUCE-1238-v0.20-1.patch
>
>
> Negative waiting_maps and waiting_reduces count is observed in the mapred metrics

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message