Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 31554 invoked from network); 14 Feb 2007 19:51:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Feb 2007 19:51:28 -0000 Received: (qmail 68043 invoked by uid 500); 14 Feb 2007 19:51:34 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 68017 invoked by uid 500); 14 Feb 2007 19:51:34 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 67984 invoked by uid 99); 14 Feb 2007 19:51:34 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Feb 2007 11:51:34 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Feb 2007 11:51:26 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0FEFA7141E5 for ; Wed, 14 Feb 2007 11:51:06 -0800 (PST) Message-ID: <28127886.1171482666062.JavaMail.jira@brutus> Date: Wed, 14 Feb 2007 11:51:06 -0800 (PST) From: "Riccardo (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1014) map/reduce is corrupting data between map and reduce In-Reply-To: <9774765.1171343045516.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473170 ] Riccardo commented on HADOOP-1014: ---------------------------------- No exceptions besides the assertion exceptions. I am uploading the patch to TestMapRed(). As I mentioned before it doesn't fail consistently, but when it does fail it is simply because it loses (key,value) pairs, no other exception is reported. At least on our cluster, the failure rate is 0% for less than 10M keys, and close to 40-50% for 100M+ keys. Hope this helps, but if I can do anything else to help out please let me know. > map/reduce is corrupting data between map and reduce > ---------------------------------------------------- > > Key: HADOOP-1014 > URL: https://issues.apache.org/jira/browse/HADOOP-1014 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.11.1 > Reporter: Owen O'Malley > Assigned To: Devaraj Das > Priority: Blocker > Fix For: 0.11.2 > > > It appears that a random data corruption is happening between the map and the reduce. This looks to be a blocker until it is resolved. There were two relevant messages on hadoop-dev: > from Mike Smith: > The map/reduce jobs are not consistent in hadoop 0.11 release and trunk both > when you rerun the same job. I have observed this inconsistency of the map > output in different jobs. A simple test to double check is to use hadoop > 0.11 with nutch trunk. > from Albert Chern: > I am having the same problem with my own map reduce jobs. I have a job > which requires two pieces of data per key, and just as a sanity check I make > sure that it gets both in the reducer, but sometimes it doesn't. What's > even stranger is, the same tasks that complain about missing key/value pairs > will maybe fail two or three times, but then succeed on a subsequent try, > which leads me to believe that the bug has to do with randomization (I'm not > sure, but I think the map outputs are shuffled?). > All of my code works perfectly with 0.9, so I went back and just compared > the sizes of the outputs. For some jobs, the outputs from 0.11 were > consistently 4 bytes larger, probably due to changes in SequenceFile. But > for others, the output sizes were all over the place. Some partitions were > empty, some were correct, and some were missing data. There seems to be > something seriously wrong with 0.11, so I suggest you use 0.9. I've been > trying to pinpoint the bug but its random nature is really annoying. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.