hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???
Date Mon, 01 Mar 2010 00:24:04 GMT
Oh sorry I was looking at the trunk code (as usual) which has
compression and many other features. It's not in the 0.20 branch.

J-D

On Sun, Feb 28, 2010 at 4:20 PM, Dan Washusen <dan@reactive.org> wrote:
> My (very rough) calculation of the data size came up with around 50MB.  That
> was assuming 400 bytes * 100,000 for the values, 32 + 8 * 13 * 100,000 for
> the keys and an extra meg or two for extra key stuff.  I didn't understand
> how that resulted in the a region split, so I assume we are still missing
> some information (or I made a mistake).  As you mention, that should mean
> that everything is in the MemStore and compression has not come into play
> yet.  Puzzling...
>
> On PE; there isn't currently a way to specify compression options on the
> testtable without extending PE and overriding
> org.apache.hadoop.hbase.PerformanceEvaluation#getTableDescriptor method.
>  Maybe it could be added as an option?
>
> Cheers,
> Dan
>
> On 1 March 2010 10:56, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>
>> As Dan said, your data is so small you don't really trigger many
>> different behaviors in HBase, it could very well kept mostly in the
>> memstores where compression has no impact at all.
>>
>> WRT a benchmark, there's the PerformanceEvaluation (we call it PE for
>> short) which is well maintained and lets you set a compression level.
>> This page has an outdated help but it shows you how to run it:
>> http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
>>
>> Another option is importing the wikipedia dump, which is highly
>> compressible and not manufactured like the PE. Last summer I wrote a
>> small MR job to do the import easily and although the code is based on
>> a dev version 0.20.0, it should be fairly easy to make it work on
>> 0.20.3 (probably just replacing the libs). See
>> http://code.google.com/p/hbase-wikipedia-loader/
>>
>> See the last paragraph of the Getting Started in the Wiki, I show some
>> import numbers:
>>
>> "For example, it took 29 min on a 6 nodes cluster (1 master and 5
>> region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB,
>> 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps),
>> and GZ compression. With LZO and a new table it took 23 min 20 ses.
>> Compressed the table is 32 regions big, uncompressed it's 93 and took
>> 30 min 10 sec to import."
>>
>> You can see that the import was a lot faster on LZO. I didn't do any
>> reading test tho...
>>
>> Good luck!
>>
>> J-D
>>
>> On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat <vincent.barat@ubikod.com>
>> wrote:
>> > The impact of my cluster architecture on the performances is obviously
>> the
>> > same in my 3 test cases. Providing that I only change the compression
>> type
>> > between tests, I don't understand why changing the number of regions or
>> > whatever else would change the speed ratio between my tests, especially
>> > between the GZIP & LZO tests.
>> >
>> > Is there some ready to use and easy to setup benchmarks I could use to
>> try
>> > to reproduce the issue in a well known environment ?
>> >
>> > Le 25/02/10 19:29, Jean-Daniel Cryans a écrit :
>> >>
>> >> If only 1 region, providing more than one nodes will probably just
>> >> slow down the test since the load is handled by one machine which has
>> >> to replicate blocks 2 times. I think your test would have much more
>> >> value if you really grew at least to 10 regions. Also make sure to run
>> >> the tests more than once on completely new hbase setups (drop table +
>> >> restart should be enough).
>> >>
>> >> May I also recommend upgrading to hbase 0.20.3? It will provide a
>> >> better experience in general.
>> >>
>> >> J-D
>> >>
>> >> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<vincent.barat@ubikod.com
>> >
>> >>  wrote:
>> >>>
>> >>> Unfortunately I can post only some snapshots.
>> >>>
>> >>> I have no region split (I insert just 100000 rows so there is no split,
>> >>> except when I don't use compression).
>> >>>
>> >>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>);
>> >>>
>> >>> The only difference between my 3 tests is the way I create the test
>> >>> table:
>> >>>
>> >>> HBaseAdmin admin = new HBaseAdmin(config);
>> >>>
>> >>> HTableDescriptor desc = new HTableDescriptor(name);
>> >>>
>> >>> HColumnDescriptor colDesc;
>> >>>
>> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("meta:"));
>> >>> colDesc.setMaxVersions(1);
>> >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE
>> >>> desc.addFamily(colDesc);
>> >>>
>> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("data:"));
>> >>> colDesc.setMaxVersions(1);
>> >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE
>> >>> desc.addFamily(colDesc);
>> >>>
>> >>> admin.createTable(desc);
>> >>>
>> >>> A typical row inserted is made of 13 columns with a short content, as
>> >>> show
>> >>> here:
>> >>>
>> >>> 1264761195240/6ffc3fe659023 column=data:accuracy,
>> >>> timestamp=1267006115356,
>> >>> value=1317
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356,
>> >>> value=0
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:country,
>> >>> timestamp=1267006115356,
>> >>> value=France
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:countrycode,
>> >>> timestamp=1267006115356, value=FR
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356,
>> >>> value=48.65869706
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:locality,
>> >>> timestamp=1267006115356,
>> >>> value=Morsang-sur-Orge
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356,
>> >>> value=2.36138182
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:postalcode,
>> >>> timestamp=1267006115356, value=91390
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=data:region,
>> timestamp=1267006115356,
>> >>> value=Ile-de-France
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>  1264761195240/6ffc3fe659023 column=meta:imei, timestamp=1267006115356,
>> >>> value=6ffc3fe659023a3c9cfed0a50a9f199e
>> >>>  a3c9cfed0a50a9f199ed42f2730 d42f2730
>> >>>  1264761195240/6ffc3fe659023 column=meta:infoid,
>> timestamp=1267006115356,
>> >>> value=ca30781e0c375a1236afbf323cbfa4
>> >>>  a3c9cfed0a50a9f199ed42f2730 0dc2c7c7af
>> >>>  1264761195240/6ffc3fe659023 column=meta:locid,
>> timestamp=1267006115356,
>> >>> value=5e15a0281e83cfe55ec1c362f84a39f
>> >>>  a3c9cfed0a50a9f199ed42f2730 006f18128
>> >>>  1264761195240/6ffc3fe659023 column=meta:timestamp,
>> >>> timestamp=1267006115356,
>> >>> value=1264761195240
>> >>>  a3c9cfed0a50a9f199ed42f2730
>> >>>
>> >>> Maybe LZO works much better with fewer rows with bigger content?
>> >>>
>> >>> Le 24/02/10 19:10, Jean-Daniel Cryans a écrit :
>> >>>>
>> >>>> Are you able to post the code used for the insertion? It could be
>> >>>> something with your usage pattern or something wrong with the code
>> >>>> itself.
>> >>>>
>> >>>> How many rows are you inserting? Do you even have some region splits?
>> >>>>
>> >>>> J-D
>> >>>>
>> >>>> On Wed, Feb 24, 2010 at 1:52 AM, Vincent Barat<
>> vincent.barat@ubikod.com>
>> >>>>  wrote:
>> >>>>>
>> >>>>> Yes of course.
>> >>>>>
>> >>>>> We use a 4 machine cluster (4 large instances on AWS): 8 GB
RAM each,
>> >>>>> dual
>> >>>>> core CPU. 1 is for the Hadoop and HBase namenode / masters,
and 3 are
>> >>>>> hosting the datanode / regionservers.
>> >>>>>
>> >>>>> The table used for testing is first created, then I insert
>> sequentially
>> >>>>> a
>> >>>>> set of rows and count the nb of rows inserted by second.
>> >>>>>
>> >>>>> I insert rows by set of 1000 (using HTable.put(list<Put>);
>> >>>>>
>> >>>>> When reading, I read also sequentially by using a scanner (scanner
>> >>>>> caching
>> >>>>> is set to 1024 rows).
>> >>>>>
>> >>>>> Maybe our installation of LZO is not good ?
>> >>>>>
>> >>>>>
>> >>>>> Le 23/02/10 22:15, Jean-Daniel Cryans a écrit :
>> >>>>>>
>> >>>>>> Vincent,
>> >>>>>>
>> >>>>>> I don't expect that either, can you give us more info about
your
>> test
>> >>>>>> environment?
>> >>>>>>
>> >>>>>> Thx,
>> >>>>>>
>> >>>>>> J-D
>> >>>>>>
>> >>>>>> On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat
>> >>>>>> <vincent.barat@ubikod.com>      wrote:
>> >>>>>>>
>> >>>>>>> Hello,
>> >>>>>>>
>> >>>>>>> I did some testing to figure out which compression algo
I should
>> use
>> >>>>>>> for
>> >>>>>>> my
>> >>>>>>> HBase tables. I thought that LZO was the good candidate,
but it
>> >>>>>>> appears
>> >>>>>>> that
>> >>>>>>> it is the worst one.
>> >>>>>>>
>> >>>>>>> I uses one table with 2 families and 10 columns. Each
row has a
>> total
>> >>>>>>> of
>> >>>>>>> 200
>> >>>>>>> to 400 bytes.
>> >>>>>>>
>> >>>>>>> Here is my results:
>> >>>>>>>
>> >>>>>>> GZIP:           2600 to 3200 inserts/s  12000
to 15000 reads/s
>> >>>>>>> NO COMPRESSION: 2000 to 2600 inserts/s  4900 to 5020
reads/s
>> >>>>>>> LZO             1600 to 2100 inserts/s  4020
to 4600 reads/s
>> >>>>>>>
>> >>>>>>> Do you have an explanation to this ? I though that the
LZO
>> >>>>>>> compression
>> >>>>>>> was
>> >>>>>>> always faster at compression and decompression than
GZIP ?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>

Mime
View raw message