hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Duplicated entries with map job reading from HBase
Date Sat, 06 Nov 2010 07:45:57 GMT
Hi,

The only thing I could thinks of right now is that perhaps you have
something similar as the effect an encountered a while age and that has been
described here:
https://issues.apache.org/jira/browse/MAPREDUCE-2094

This _should_ only occur when reading regular files. But perhaps you are
experiencing something similar.

NIels


2010/11/6 Adam Phelps <amp@opendns.com>

> Yeah, it wasn't the combiner.  The repeated entries are actually seen by
> the mapper, so before the combiner comes into play.  Is there some other
> info that would be useful in getting clues as to what is causing this?
>
> - Adam
>
>
> On 11/5/10 11:35 AM, Adam Phelps wrote:
>
>> No, the system actually is much larger than two nodes. But the number of
>> mappers used here tends to be fairly small (I suspect based on the HBase
>> regions being accessed but usually more than two), I'll try turning off
>> the combiner to see if that changes anything.
>>
>> Thanks
>> - Adam
>>
>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>
>>> Hi,
>>>
>>> I don't know the answer (simply not enough information in your email)
>>> but I'm willing to make a guess:
>>> You are running on a system with two processing nodes?
>>> If so then try removing the Combiner. The combiner is a performance
>>> optimization and the whole processing should work without it.
>>> Some times there is a design fault in the processing and the combiner
>>> disrupts the processing.
>>>
>>> HTH
>>>
>>> Niels Basjes
>>>
>>> 2010/11/5 Adam Phelps <amp@opendns.com <mailto:amp@opendns.com>>
>>>
>>> I've noticed an odd behavior with a map-reduce job I've written
>>> which is reading data out of an HBase table. After a couple days of
>>> poking at this I haven't been able to figure out the cause of the
>>> problem, so I figured I'd ask on here.
>>>
>>> (For reference I'm running with the cdh3b2 release)
>>>
>>> The problem is that it seems that every line from the HBase table is
>>> passed to the mappers twice, thus resulting in counts ending up as
>>> exactly double what they should be.
>>>
>>> I set up the job like this:
>>>
>>> Scan scan = new Scan();
>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>
>>> TableMapReduceUtil.initTableMapperJob(table,
>>> scan,
>>> mapper,
>>> Text.class,
>>> LongWritable.class,
>>> job);
>>> job.setCombinerClass(LongSumReducer.class);
>>>
>>> job.setReducerClass(reducer);
>>>
>>> I've set up counters in the mapper to verify what is happening, so
>>> that I know for certain that the mapper is being called twice with
>>> the same bit of data. I've also confirmed (using the hbase shell)
>>> that each entry appears only once in the table.
>>>
>>> Is there a known bug along these lines? If not, does anyone have
>>> any thoughts on what might be causing this or where I'd start
>>> looking to diagnose?
>>>
>>> Thanks
>>> - Adam
>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>>
>>
>>
>


-- 
Met vriendelijke groeten,

Niels Basjes

Mime
View raw message