Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9527AD12C for ; Thu, 16 Aug 2012 18:34:43 +0000 (UTC) Received: (qmail 94673 invoked by uid 500); 16 Aug 2012 18:34:41 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 94574 invoked by uid 500); 16 Aug 2012 18:34:41 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 94566 invoked by uid 99); 16 Aug 2012 18:34:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Aug 2012 18:34:41 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Aug 2012 18:34:35 +0000 Received: by pbbrp16 with SMTP id rp16so2424743pbb.35 for ; Thu, 16 Aug 2012 11:34:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer:x-gm-message-state; bh=fpj0AweTE3SG/SSa09x9lYJMG4o2Dbg4ghxXF/cYqQI=; b=Icz5dKJbHaK5EAs2SunjOaM9TrVfFZImYSxpzlHOGbMOKN3Nq+GIEXuJihPliGoU6B SJYPCR1De4lfFiC7vkqh6Tysl8XvjKDmAor8pqSmrFDcwoluJQFLHAI2HuZQxB5hb86n tXXbXZeayvx0vVpiU9aDfh59N5iaDniPDcLjXUL1T/5OVYyiEH0VCDMst5QQ6INb4fCr xyKhZjU8Jbe2TmIBzHjNruzjMcB5qXx9xu3mdj4LMGWKzxVLugyZs9BkuLRzOdKwJZ0O Ar5VpSPPBiRnvlibJNlySx35hyjhvw4g4BQqkJDWpl0/8KjijgH3eMy5IHeK19iT3//b rmLQ== Received: by 10.66.77.169 with SMTP id t9mr3865578paw.70.1345142055165; Thu, 16 Aug 2012 11:34:15 -0700 (PDT) Received: from [10.10.11.145] (host1.hortonworks.com. [70.35.59.2]) by mx.google.com with ESMTPS id oc2sm3133614pbb.69.2012.08.16.11.34.07 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 16 Aug 2012 11:34:13 -0700 (PDT) From: Arun C Murthy Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-464--384856943 Subject: Re: Checksum Error during Reduce Phase hadoop-1.0.2 Date: Thu, 16 Aug 2012 11:34:06 -0700 In-Reply-To: To: common-dev@hadoop.apache.org References: Message-Id: X-Mailer: Apple Mail (2.1084) X-Gm-Message-State: ALoCoQkHOKM7IrrJ+1fUA3RkOAuMbryyCb3AtKCeuqPI83BO8OQpxjHG6syQGTgj8n7aVYwnXGcy X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-464--384856943 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Primarily, it could be caused by a corrupt disk - which is why checking = if it's happening on a specific node(s) can help. Arun On Aug 16, 2012, at 10:04 AM, Pavan Kulkarni wrote: > Harsh, >=20 > I see this on couple of nodes.But what may be the cause of this error = ?Any > idea about it? Thanks >=20 > On Sun, Aug 12, 2012 at 9:06 AM, Harsh J wrote: >=20 >> Hi Pavan, >>=20 >> Do you see this happen on a specific node every time (i.e. when the >> reducer runs there)? >>=20 >> On Fri, Aug 10, 2012 at 11:43 PM, Pavan Kulkarni >> wrote: >>> Hi, >>>=20 >>> I am running a Terasort with a cluster of 8 nodes.The map phase >> completes >>> but when the reduce phase is around 68-70% I get this following = error. >>>=20 >>> * >>> 12/08/10 11:02:36 INFO mapred.JobClient: Task Id : >>> attempt_201208101018_0001_r_000027_0, Status : FAILED >>> java.lang.RuntimeException: problem advancing post rec#38320220 >>> * >>> * at >>> org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1214)* >>> * at >>>=20 >> = org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Reduce= Task.java:249) >>> * >>> * at >>>=20 >> = org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.j= ava:245) >>> * >>> * at >>>=20 >> = org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:4= 0) >>> * >>> * at >>> = org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)* >>> * at = org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)* >>> * at org.apache.hadoop.mapred.Child$4.run(Child.java:255)* >>> * at java.security.AccessController.doPrivileged(Native = Method)* >>> * at javax.security.auth.Subject.doAs(Subject.java:416)* >>> * at >>>=20 >> = org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.= java:1093) >>> * >>> * at org.apache.hadoop.mapred.Child.main(Child.java:249)* >>> *Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error* >>> * at >>>=20 >> = org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164= )* >>> * at >>>=20 >> = org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)*= >>> * at >> org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)* >>> * at >> org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)* >>> * at >>> org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)* >>> * at = org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:374)* >>> * at >> org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)* >>> * at >>>=20 >> = org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java= :330) >>> * >>> * at >> org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) >>> * >>> * at >>>=20 >> = org.apache.hadoop.mapred.ReduceTask$ReduceCopier$RawKVIteratorReader.next(= ReduceTask.java:2531) >>> * >>> * at >> org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)* >>> * at >>>=20 >> = org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java= :330) >>> * >>> * at >> org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) >>> * >>> * at >>> = org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:1253)* >>> * at >>> org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1212)* >>> * ... 10 more* >>>=20 >>> I came across somone facing the same >>> issue< >> = http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201001.mbox/%3= C1c802db51001280427j5b8e57dai4a8d0fdd038f41@mail.gmail.com%3E >>> in >>> the mail-archives and he seemed to resolve it by listing hostnames = in >>> the */etc/hosts *file, >>> but all my nodes have correct info about the hostnames in = /etc/hosts, >> but I >>> still have these reducers throwing error. >>> Any help regarding this issue is appreciated .Thanks >>>=20 >>> -- >>>=20 >>> --With Regards >>> Pavan Kulkarni >>=20 >>=20 >>=20 >> -- >> Harsh J >>=20 >=20 >=20 >=20 > --=20 >=20 > --With Regards > Pavan Kulkarni -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ --Apple-Mail-464--384856943--