hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yuzhih...@gmail.com
Subject Re: strange PerformanceEvaluation behaviour
Date Wed, 15 Feb 2012 10:50:07 GMT
Oliver:
Thanks for digging. 

Please file Jira's for these issues. 



On Feb 15, 2012, at 1:53 AM, "Oliver Meyn (GBIF)" <omeyn@gbif.org> wrote:

> On 2012-02-15, at 9:09 AM, Oliver Meyn (GBIF) wrote:
> 
>> On 2012-02-15, at 7:32 AM, Stack wrote:
>> 
>>> On Tue, Feb 14, 2012 at 8:14 AM, Stack <stack@duboce.net> wrote:
>>>>> 2) With that same randomWrite command line above, I would expect a resulting
table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M rows).  Instead what I'm seeing
is that the randomWrite job reports writing that many rows (exactly) but running rowcounter
against the table reveals only 6549899 rows.  A second attempt to build the table produces
slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead
of 10 clients (~35% smaller than expected).  Key collision could explain it, but it seems
pretty unlikely (given I only need e.g. 10M keys from a potential 2B).
>>>>> 
>>>> 
>>> 
>>> I just tried it here and got similar result.  I wonder if its the
>>> randomWrite?  What if you do sequentialWrite, do you get our 10M?
>> 
>> Thanks for checking into this stack - when using sequentialWrite I get the expected
10485700 rows.  I'll hack around a bit on the PE to count the number of collisions, and try
to think of a reasonable solution.
> 
> So hacking around reveals that key collision is indeed the problem.  I thought the modulo
part of the getRandomRow method was suspect but while removing it improved the behaviour (I
got ~8M rows instead of ~6.6M) it didn't fix it completely.  Since that's really what UUIDs
are for I gave that a shot (i.e UUID.randomUUID()) and sure enough now I get the full 10M
rows.  Those are 16-byte keys now though, instead of the 10-byte that the integers produced.
 But because we're testing scan performance I think using a sequentially written table would
probably be cheating and so will stick with randomWrite with slightly bigger keys.  That means
it's a little harder to compare to the results that other people get, but at least I know
my internal tests are apples to apples.
> 
> Oh and I removed the outer 10x loop and that produced the desired number of mappers (ie
what I passed in on the commandline) but made no difference in the key generation/collision
story.
> 
> Should I file bugs for these 2 issues?
> 
> Thanks,
> Oliver
> 

Mime
View raw message