hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Phelps <...@opendns.com>
Subject Re: Duplicated entries with map job reading from HBase
Date Tue, 09 Nov 2010 18:29:38 GMT
That had been my initial thought, however dumping the data from hbase 
shell only found single entries.  Further with the experiment I ran 
yesterday (generating the data to a new table as well as the old one) 
the entries being created should have been identical for each table.

The only thing I can think of here is that some metadata for the 
original table has been messed up in such a way that its being double 
processed as input for the mapper, but I have no idea about where to 
look for that.

- Adam

On 11/9/10 4:20 AM, Biedermann,S.,Fa. Post Direkt wrote:
> Hi Adam,
> Is it possible that you have double entries in your old table (two entries for the same
(column family, column, timestamp) tuple)?
> Sven
> -----Urspr√ľngliche Nachricht-----
> Von: Adam Phelps [mailto:amp@opendns.com]
> Gesendet: Dienstag, 9. November 2010 01:30
> An: mapreduce-user@hadoop.apache.org; user@hbase.apache.org
> Betreff: Re: Duplicated entries with map job reading from HBase
> Ok, poked around at this a little more with a few experiments.
> The most interesting one is that I ran a a couple of the jobs that generate this data
in HBase, one for the existing table I had seen the problem on and one for a new table with
the same configuration as the old one.
> When the analysis job is run reading from HBase the counts are only doubled against the
older table, using the new table as input produces the correct results.
> When doing this I also noticed that when using the new table only a single mapper is
created, however for the old table two mappers are created (I checked and the data comes from
only a single region in either case).
> So something is causing each hbase entry to be passed to a mapper twice on the older
table, but only once on the newer table.
> Anyone have further thoughts on this?  I'm basically at the end of my ideas on figuring
this out.
> - Adam
> On 11/5/10 4:01 PM, Adam Phelps wrote:
>> Yeah, it wasn't the combiner. The repeated entries are actually seen
>> by the mapper, so before the combiner comes into play. Is there some
>> other info that would be useful in getting clues as to what is causing this?
>> - Adam
>> On 11/5/10 11:35 AM, Adam Phelps wrote:
>>> No, the system actually is much larger than two nodes. But the number
>>> of mappers used here tends to be fairly small (I suspect based on the
>>> HBase regions being accessed but usually more than two), I'll try
>>> turning off the combiner to see if that changes anything.
>>> Thanks
>>> - Adam
>>> On 11/5/10 9:23 AM, Niels Basjes wrote:
>>>> Hi,
>>>> I don't know the answer (simply not enough information in your
>>>> email) but I'm willing to make a guess:
>>>> You are running on a system with two processing nodes?
>>>> If so then try removing the Combiner. The combiner is a performance
>>>> optimization and the whole processing should work without it.
>>>> Some times there is a design fault in the processing and the
>>>> combiner disrupts the processing.
>>>> HTH
>>>> Niels Basjes
>>>> 2010/11/5 Adam Phelps<amp@opendns.com<mailto:amp@opendns.com>>
>>>> I've noticed an odd behavior with a map-reduce job I've written
>>>> which is reading data out of an HBase table. After a couple days of
>>>> poking at this I haven't been able to figure out the cause of the
>>>> problem, so I figured I'd ask on here.
>>>> (For reference I'm running with the cdh3b2 release)
>>>> The problem is that it seems that every line from the HBase table is
>>>> passed to the mappers twice, thus resulting in counts ending up as
>>>> exactly double what they should be.
>>>> I set up the job like this:
>>>> Scan scan = new Scan();
>>>> scan.addFamily(Bytes.toBytes(scanFamily));
>>>> TableMapReduceUtil.initTableMapperJob(table,
>>>> scan,
>>>> mapper,
>>>> Text.class,
>>>> LongWritable.class,
>>>> job);
>>>> job.setCombinerClass(LongSumReducer.class);
>>>> job.setReducerClass(reducer);
>>>> I've set up counters in the mapper to verify what is happening, so
>>>> that I know for certain that the mapper is being called twice with
>>>> the same bit of data. I've also confirmed (using the hbase shell)
>>>> that each entry appears only once in the table.
>>>> Is there a known bug along these lines? If not, does anyone have any
>>>> thoughts on what might be causing this or where I'd start looking to
>>>> diagnose?
>>>> Thanks
>>>> - Adam
>>>> --
>>>> Met vriendelijke groeten,
>>>> Niels Basjes

View raw message