Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Sender: Christopher Tarnas <cft@tarnas.org>
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: GZ better than LZO?
From: Chris Tarnas <cft@email.com>
In-Reply-To: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCC8@kairo.scch.at>
Date: Fri, 29 Jul 2011 09:48:45 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <113761CA-0785-4599-8992-86B96C4EA65B@email.com>
References: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCC4@kairo.scch.at>
 <C23C8744-923A-4E12-9A6E-41C00EBE580B@email.com>
 <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCC8@kairo.scch.at>
To: user@hbase.apache.org

Your region distribution across the nodes is not great, for both cases =
most of your data is going to one server, spreading the regions out =
across multiple servers would be best.

How many different vehicle_ids are being used, and are they all =
sequential integers in your tests? Hbase performs better when not doing =
sequential inserts. You could try reversing the vehicle ids to get =
around that (see the many discussions on the list about using reverse =
timestamps as a rowkey)

Looking at your key construction I would suggest, unless your app =
requires it, to not left-pad  your ids with zeros and rather use a =
delimiter between the key components. That will lead to smaller keys, if =
you use a tab as your delimiter that character falls before all other =
alphanumeric and punctuation characters (other than LF, CR, etc - =
characters that should not be in your IDs) so the keys will sort the =
same and left padded ones.=20

I've had good luck with converting sequential numeric IDs to base 64 and =
then reversing them - that leads to very good key distribution across =
regions and shorter keys for any given number. Another option - if you =
don't care if your rowkeys are plaintext, is to convert the IDs to =
binary numbers and then reverse the bytes - that would be the most =
compact. If you do that you would go back to not using delimiters and =
just have fixed offsets for each component.

Once you have a rowkey design you can then go ahead and create your =
tables pre-split with multiple empty regions. That should perform much =
better over all for inserts, especially when the DB is new and empty to =
start.

How did the load with 4 million records perform?

-chris

On Jul 29, 2011, at 12:36 AM, Steinmaurer Thomas wrote:

> Hi Chris!
>=20
> Your questions are somehow hard to answer for me, because I'm not =
really
> in charge for the test cluster from an administration/setup POV.
>=20
> Basically, when running:
> http://xxx:60010/master.jsp
>=20
> I see 7 region servers. Each with a "maxHeap" value of 995.
>=20
> When clicking on the different tables depending on the compression =
type,
> I get the following information:
>=20
> GZ compressed table: 3 regions hosted by one region server
> LZO compressed table: 8 regions hosted by two region servers, where =
the
> start region is hosted by one region server and all other 7 regions =
are
> hosted on the second region server
>=20
> Regarding the insert pattern etc... please have a look on my reply to
> Chiku, where I describe the test data generator and the table layout =
etc
> ... a bit.
>=20
> Thanks,
> Thomas
>=20
> -----Original Message-----
> From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris
> Tarnas
> Sent: Donnerstag, 28. Juli 2011 19:43
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
>=20
> During the load did you add enough data to do a flush or compaction? =
P,
> In our cluster that amount of data inserted would not necessarily be
> enough to actually flush store files. Performance really depends on =
how
> the table's regions are laid out, the insert pattern, the number of
> regionservers and the amount of RAM allocated to each regionserver. If
> you don't see any flushes or compactions in the log try repeating that
> test and then flushing the table and do a compaction (or add more data
> so it happens automatically) and timing everything. It would be
> interesting to see if the GZ benefit holds up.
>=20
> -chris
>=20
> On Jul 28, 2011, at 6:31 AM, Steinmaurer Thomas wrote:
>=20
>> Hello,
>>=20
>>=20
>>=20
>> we ran a test client generating data into GZ and LZO compressed =
table.
>> Equal data sets (number of rows: 1008000 and the same table schema). =
~
>> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ =
is
>=20
>> ~
>> 444 MB, so basically half of LZO.
>>=20
>>=20
>>=20
>> Execution time of the data generating client was 1373 seconds into =
the
>=20
>> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The =
data
>=20
>> generation client is based on HTablePool and using batch operations.
>>=20
>>=20
>>=20
>> So in our (simple) test, GZ beats LZO in both, disk usage and=20
>> execution time of the client. We haven't tried reads yet.
>>=20
>>=20
>>=20
>> Is this an expected result? I thought LZO is the recommended=20
>> compression algorithm? Or does LZO outperforms GZ with a growing=20
>> amount of data or in read scenarios?
>>=20
>>=20
>>=20
>> Regards,
>>=20
>> Thomas
>>=20
>>=20
>>=20
>=20