Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 43866 invoked from network); 29 Oct 2010 02:06:25 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Oct 2010 02:06:25 -0000 Received: (qmail 33600 invoked by uid 500); 29 Oct 2010 02:06:25 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 33568 invoked by uid 500); 29 Oct 2010 02:06:24 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 33560 invoked by uid 99); 29 Oct 2010 02:06:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Oct 2010 02:06:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Oct 2010 02:06:24 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o9T263pT016235 for ; Fri, 29 Oct 2010 02:06:03 GMT Message-ID: <2491475.132421288317963931.JavaMail.jira@thor> Date: Thu, 28 Oct 2010 22:06:03 -0400 (EDT) From: "Hudson (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-577) Duplicate Mapper input when using StreamXmlRecordReader MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926075#action_12926075 ] Hudson commented on MAPREDUCE-577: ---------------------------------- Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See [https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/]) > Duplicate Mapper input when using StreamXmlRecordReader > ------------------------------------------------------- > > Key: MAPREDUCE-577 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-577 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/streaming > Environment: HADOOP 0.17.0, Java 6.0 > Reporter: David Campbell > Assignee: Ravi Gummadi > Fix For: 0.22.0 > > Attachments: 0001-test-to-demonstrate-HADOOP-3484.patch, 0002-patch-for-HADOOP-3484.patch, 577.20S.patch, 577.patch, 577.v1.patch, 577.v2.patch, 577.v3.patch, 577.v4.patch, HADOOP-3484.combined.patch, HADOOP-3484.try3.patch > > > I have an XML file with 93626 rows. A row is marked by .... > I've confirmed this with grep and the Grep example program included with HADOOP. > Here is the grep example output. 93626 > I've setup my job configuration as follows: > conf.set("stream.recordreader.class", "org.apache.hadoop.streaming.StreamXmlRecordReader"); > conf.set("stream.recordreader.begin", ""); > conf.set("stream.recordreader.end", ""); > conf.setInputFormat(StreamInputFormat.class); > I have a fairly simple test Mapper. > Here's the map method. > public void map(Text key, Text value, OutputCollector output, Reporter reporter) throws IOException { > try { > output.collect(totalWord, one); > if (key != null && key.toString().indexOf("01852") != -1) { > output.collect(new Text("01852"), one); > } > } catch (Exception ex) { > Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, null, ex); > System.out.println(value); > } > } > For totalWord ("TOTAL"), I get: > TOTAL 140850 > and for 01852 I get. > 01852 86 > There are 43 instances of 01852 in the file. > I have the following setting in my config. > conf.setNumMapTasks(1); > I have a total of six machines in my cluster. > If I run without this, the result is 12x the actual value, not 2x. > Here's some info from the cluster web page. > Maps Reduces Total Submissions Nodes Map Task Capacity Reduce Task Capacity Avg. Tasks/Node > 0 0 1 6 12 12 4.00 > I've also noticed something really strange in the job's output. It looks like it's starting over or redoing things. > This was run using all six nodes and no limitations on map or reduce tasks. I haven't seen this behavior in any other case. > 08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 1 > 08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018 > 08/06/03 10:50:37 INFO mapred.JobClient: map 0% reduce 0% > 08/06/03 10:50:42 INFO mapred.JobClient: map 2% reduce 0% > 08/06/03 10:50:45 INFO mapred.JobClient: map 12% reduce 0% > 08/06/03 10:50:47 INFO mapred.JobClient: map 31% reduce 0% > 08/06/03 10:50:48 INFO mapred.JobClient: map 49% reduce 0% > 08/06/03 10:50:49 INFO mapred.JobClient: map 68% reduce 0% > 08/06/03 10:50:50 INFO mapred.JobClient: map 100% reduce 0% > 08/06/03 10:50:54 INFO mapred.JobClient: map 87% reduce 0% > 08/06/03 10:50:55 INFO mapred.JobClient: map 100% reduce 0% > 08/06/03 10:50:56 INFO mapred.JobClient: map 0% reduce 0% > 08/06/03 10:51:00 INFO mapred.JobClient: map 0% reduce 1% > 08/06/03 10:51:05 INFO mapred.JobClient: map 28% reduce 2% > 08/06/03 10:51:07 INFO mapred.JobClient: map 80% reduce 4% > 08/06/03 10:51:08 INFO mapred.JobClient: map 100% reduce 4% > 08/06/03 10:51:09 INFO mapred.JobClient: map 100% reduce 7% > 08/06/03 10:51:10 INFO mapred.JobClient: map 90% reduce 9% > 08/06/03 10:51:11 INFO mapred.JobClient: map 100% reduce 9% > 08/06/03 10:51:12 INFO mapred.JobClient: map 100% reduce 11% > 08/06/03 10:51:13 INFO mapred.JobClient: map 90% reduce 11% > 08/06/03 10:51:14 INFO mapred.JobClient: map 97% reduce 11% > 08/06/03 10:51:15 INFO mapred.JobClient: map 63% reduce 11% > 08/06/03 10:51:16 INFO mapred.JobClient: map 48% reduce 11% > 08/06/03 10:51:17 INFO mapred.JobClient: map 21% reduce 11% > 08/06/03 10:51:19 INFO mapred.JobClient: map 0% reduce 11% > 08/06/03 10:51:20 INFO mapred.JobClient: map 15% reduce 12% > 08/06/03 10:51:21 INFO mapred.JobClient: map 27% reduce 13% > 08/06/03 10:51:22 INFO mapred.JobClient: map 67% reduce 13% > 08/06/03 10:51:24 INFO mapred.JobClient: map 22% reduce 16% > 08/06/03 10:51:25 INFO mapred.JobClient: map 46% reduce 16% > 08/06/03 10:51:26 INFO mapred.JobClient: map 70% reduce 16% > 08/06/03 10:51:27 INFO mapred.JobClient: map 73% reduce 18% > 08/06/03 10:51:28 INFO mapred.JobClient: map 85% reduce 19% > 08/06/03 10:51:29 INFO mapred.JobClient: map 7% reduce 19% > 08/06/03 10:51:32 INFO mapred.JobClient: map 100% reduce 20% > 08/06/03 10:51:35 INFO mapred.JobClient: map 100% reduce 22% > 08/06/03 10:51:37 INFO mapred.JobClient: map 100% reduce 23% > 08/06/03 10:51:38 INFO mapred.JobClient: map 100% reduce 46% > 08/06/03 10:51:39 INFO mapred.JobClient: map 100% reduce 58% > 08/06/03 10:51:40 INFO mapred.JobClient: map 100% reduce 80% > 08/06/03 10:51:42 INFO mapred.JobClient: map 100% reduce 90% > 08/06/03 10:51:43 INFO mapred.JobClient: map 100% reduce 100% > 08/06/03 10:51:44 INFO mapred.JobClient: Job complete: job_200806030916_0018 > 08/06/03 10:51:44 INFO mapred.JobClient: Counters: 17 > 08/06/03 10:51:44 INFO mapred.JobClient: File Systems > 08/06/03 10:51:44 INFO mapred.JobClient: Local bytes read=1705 > 08/06/03 10:51:44 INFO mapred.JobClient: Local bytes written=29782 > 08/06/03 10:51:44 INFO mapred.JobClient: HDFS bytes read=1366064660 > 08/06/03 10:51:44 INFO mapred.JobClient: HDFS bytes written=23 > 08/06/03 10:51:44 INFO mapred.JobClient: Job Counters > 08/06/03 10:51:44 INFO mapred.JobClient: Launched map tasks=37 > 08/06/03 10:51:44 INFO mapred.JobClient: Launched reduce tasks=10 > 08/06/03 10:51:44 INFO mapred.JobClient: Data-local map tasks=22 > 08/06/03 10:51:44 INFO mapred.JobClient: Rack-local map tasks=15 > 08/06/03 10:51:44 INFO mapred.JobClient: Map-Reduce Framework > 08/06/03 10:51:44 INFO mapred.JobClient: Map input records=942105 > 08/06/03 10:51:44 INFO mapred.JobClient: Map output records=942621 > 08/06/03 10:51:44 INFO mapred.JobClient: Map input bytes=1365761556 > 08/06/03 10:51:44 INFO mapred.JobClient: Map output bytes=9426210 > 08/06/03 10:51:44 INFO mapred.JobClient: Combine input records=942621 > 08/06/03 10:51:44 INFO mapred.JobClient: Combine output records=49 > 08/06/03 10:51:44 INFO mapred.JobClient: Reduce input groups=2 > 08/06/03 10:51:44 INFO mapred.JobClient: Reduce input records=49 > 08/06/03 10:51:44 INFO mapred.JobClient: Reduce output records=2 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.