hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Meyn (GBIF)" <om...@gbif.org>
Subject Re: strange PerformanceEvaluation behaviour
Date Thu, 16 Feb 2012 09:37:54 GMT
On 2012-02-15, at 5:39 PM, Stack wrote:

> On Wed, Feb 15, 2012 at 1:53 AM, Oliver Meyn (GBIF) <omeyn@gbif.org> wrote:
>> So hacking around reveals that key collision is indeed the problem.  I thought the
modulo part of the getRandomRow method was suspect but while removing it improved the behaviour
(I got ~8M rows instead of ~6.6M) it didn't fix it completely.  Since that's really what UUIDs
are for I gave that a shot (i.e UUID.randomUUID()) and sure enough now I get the full 10M
rows.  Those are 16-byte keys now though, instead of the 10-byte that the integers produced.
 But because we're testing scan performance I think using a sequentially written table would
probably be cheating and so will stick with randomWrite with slightly bigger keys.  That means
it's a little harder to compare to the results that other people get, but at least I know
my internal tests are apples to apples.
>> Oh and I removed the outer 10x loop and that produced the desired number of mappers
(ie what I passed in on the commandline) but made no difference in the key generation/collision
>> Should I file bugs for these 2 issues?
> Thanks Oliver for digging.
> Using UUIDs will make it tougher on the other end when reading?  How
> do you divide up the UUID space?  UUIDs are not well distributed
> across the possible key space IIUC.
> Should writing UUIDs be an option on PE?
> Thanks again for figuring it.
> St.Ack

Honestly I don't know very much about UUIDs so I didn't consider their distribution over the
keyspace - just used UUID.randomUUID() and more or less crossed my fingers :)  I agree that
they're a bit of a PITA when it comes to reading them, but I think having exactly the expected
number of rows written and read in the test makes the PE more obvious and therefore more useful/useable.
 So yes, I think UUIDs as a key option in PE would be good.  I left some code in the JIRA
as a starting point for a patch.

View raw message