Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Fri, 13 Jul 2012 20:36:33 +0000 (UTC)
From: "Rahul Jain (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <2067639778.50192.1342211794683.JavaMail.jiratomcat@issues-vm>
In-Reply-To: <948862482.50075.1342211074866.JavaMail.jiratomcat@issues-vm>
Subject: [jira] [Updated] (MAPREDUCE-4442) Accessing hadoop counters from a
 job is unreliable in yarn during in AM process cleanup  window
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/MAPREDUCE-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rahul Jain updated MAPREDUCE-4442:
----------------------------------

    Attachment: am_logs_counter_failure.html

Attached AM logs for the full job; the timestamps should correlate to the application timestamps.
                
> Accessing hadoop counters from a job is unreliable in yarn during in AM process cleanup  window
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4442
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4442
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>         Attachments: am_logs_counter_failure.html
>
>
> We found this issue during our tests moving from MapReduceV1 to MapReduceV2. A few of our applications access job counters multiple times:
> a) After submission of job, while job is execution (works fine)
> b) Right after job complete notification is received (works fine)
> c) Few seconds after job complete notification (fails most of the times).
> The error snippet is as follows:
> {code}
> 2012-07-12 19:12:29,039 WARN  [Client] Unexpected error reading responses on connection Thread[IPC Client (1252749669) connection to sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main]
> java.lang.NullPointerException
> 	at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781)
> 2012-07-12 19:12:29,044 INFO  [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> 2012-07-12 19:12:29,132 INFO  [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> 2012-07-12 19:12:29,216 ERROR [UserGroupInformation] PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException
> 2012-07-12 19:12:29,216 WARN  [BaseOutputStageJob] getJobCounters: Unable to retrieve counters. null
> java.io.IOException
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335)
> 	at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470)
> 	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719)
> 	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> 	at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716)
> 	at org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396)
> {code}
> The connection to 10.202.50.187:47944 is actually the connection to AM; appears that we are connecting to AM to get the counters for the successful job and not the history server.
>  
> I'll attach the logs for AM and resource mgr separately, however no unusual activity is seen in those.
> This makes me suspect that we have a race condition in the code trying to access job counters when AM is finishing up and the job hasn't moved to history server yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira