Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 22708 invoked from network); 5 Nov 2010 00:57:40 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Nov 2010 00:57:40 -0000 Received: (qmail 84713 invoked by uid 500); 5 Nov 2010 00:58:11 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 84650 invoked by uid 500); 5 Nov 2010 00:58:10 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 84636 invoked by uid 99); 5 Nov 2010 00:58:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Nov 2010 00:58:10 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of amp@opendns.com designates 67.215.68.163 as permitted sender) Received: from [67.215.68.163] (HELO mail.opendns.com) (67.215.68.163) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Nov 2010 00:58:02 +0000 Received: from Adams-Desktop.local ([67.215.69.42]) (authenticated bits=0) by mail.opendns.com (8.14.3/8.14.3/Debian-5) with ESMTP id oA50vfAM008297 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO); Fri, 5 Nov 2010 00:57:41 GMT Message-ID: <4CD35684.4020504@opendns.com> Date: Thu, 04 Nov 2010 17:57:40 -0700 From: Adam Phelps User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6 MIME-Version: 1.0 To: mapreduce-user@hadoop.apache.org, user@hbase.apache.org Subject: Duplicated entries with map job reading from HBase Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit I've noticed an odd behavior with a map-reduce job I've written which is reading data out of an HBase table. After a couple days of poking at this I haven't been able to figure out the cause of the problem, so I figured I'd ask on here. (For reference I'm running with the cdh3b2 release) The problem is that it seems that every line from the HBase table is passed to the mappers twice, thus resulting in counts ending up as exactly double what they should be. I set up the job like this: Scan scan = new Scan(); scan.addFamily(Bytes.toBytes(scanFamily)); TableMapReduceUtil.initTableMapperJob(table, scan, mapper, Text.class, LongWritable.class, job); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(reducer); I've set up counters in the mapper to verify what is happening, so that I know for certain that the mapper is being called twice with the same bit of data. I've also confirmed (using the hbase shell) that each entry appears only once in the table. Is there a known bug along these lines? If not, does anyone have any thoughts on what might be causing this or where I'd start looking to diagnose? Thanks - Adam