Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: strange PerformanceEvaluation behaviour
From: "Oliver Meyn (GBIF)" <omeyn@gbif.org>
In-Reply-To: <9301704E-0FA0-4D04-8267-6AC4C25BEF0F@gbif.org>
Date: Wed, 15 Feb 2012 10:53:40 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <A0945339-9121-45D5-A123-38ED7CAA71D7@gbif.org>
References: <DC472608-AB20-4909-B40D-24B790B16B9E@gbif.org>
 <CADcMMgEdQmsoFPsdpwcO41cAua77EFO8Yfc85jvt_wg-+303ww@mail.gmail.com>
 <CADcMMgHWeRjOKW7kJDtinsP5PLQL4_V3rRfKAW8avx5XXPtp=w@mail.gmail.com>
 <9301704E-0FA0-4D04-8267-6AC4C25BEF0F@gbif.org>
To: user@hbase.apache.org

On 2012-02-15, at 9:09 AM, Oliver Meyn (GBIF) wrote:

> On 2012-02-15, at 7:32 AM, Stack wrote:
>=20
>> On Tue, Feb 14, 2012 at 8:14 AM, Stack <stack@duboce.net> wrote:
>>>> 2) With that same randomWrite command line above, I would expect a =
resulting table with 10 * (1024 * 1024) rows (so 10485700 =3D roughly =
10M rows).  Instead what I'm seeing is that the randomWrite job reports =
writing that many rows (exactly) but running rowcounter against the =
table reveals only 6549899 rows.  A second attempt to build the table =
produces slightly different results (e.g. 6627689).  I see a similar =
discrepancy when using 50 instead of 10 clients (~35% smaller than =
expected).  Key collision could explain it, but it seems pretty unlikely =
(given I only need e.g. 10M keys from a potential 2B).
>>>>=20
>>>=20
>>=20
>> I just tried it here and got similar result.  I wonder if its the
>> randomWrite?  What if you do sequentialWrite, do you get our 10M?
>=20
> Thanks for checking into this stack - when using sequentialWrite I get =
the expected 10485700 rows.  I'll hack around a bit on the PE to count =
the number of collisions, and try to think of a reasonable solution.

So hacking around reveals that key collision is indeed the problem.  I =
thought the modulo part of the getRandomRow method was suspect but while =
removing it improved the behaviour (I got ~8M rows instead of ~6.6M) it =
didn't fix it completely.  Since that's really what UUIDs are for I gave =
that a shot (i.e UUID.randomUUID()) and sure enough now I get the full =
10M rows.  Those are 16-byte keys now though, instead of the 10-byte =
that the integers produced.  But because we're testing scan performance =
I think using a sequentially written table would probably be cheating =
and so will stick with randomWrite with slightly bigger keys.  That =
means it's a little harder to compare to the results that other people =
get, but at least I know my internal tests are apples to apples.

Oh and I removed the outer 10x loop and that produced the desired number =
of mappers (ie what I passed in on the commandline) but made no =
difference in the key generation/collision story.

Should I file bugs for these 2 issues?

Thanks,
Oliver