hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Washusen <...@reactive.org>
Subject Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???
Date Sun, 28 Feb 2010 21:46:31 GMT
Couple of questions;

   - What's your block cache hit ratio when running each of those tests?
   - How large are the store files in each of the tests?
      - What's the compression ratio between None, LZO and GZIP?

Does the test match your expected usage scenario?  Do you intend to serve
all data from block cache or will there be a lot more data in real life?
 With such a small dataset your are probably not seeing the full benefits of
the bullet points mentioned on the HBase + LZO page because all the data
resides in memory on the region server...

Here are the points from http://wiki.apache.org/hadoop/UsingLzoCompression:

>
>    - Compression reduces the number of bytes written to/read from HDFS
>    - Compression effectively improves the efficiency of network bandwidth
>    and disk space
>    - Compression reduces the size of data needed to be read when issuing a
>    read
>
> It's puzzling that GZIP is faster than no compression in your tests...


On 1 March 2010 04:30, Vincent Barat <vincent.barat@ubikod.com> wrote:

> The impact of my cluster architecture on the performances is obviously the
> same in my 3 test cases. Providing that I only change the compression type
> between tests, I don't understand why changing the number of regions or
> whatever else would change the speed ratio between my tests, especially
> between the GZIP & LZO tests.
>
> Is there some ready to use and easy to setup benchmarks I could use to try
> to reproduce the issue in a well known environment ?
>
> Le 25/02/10 19:29, Jean-Daniel Cryans a écrit :
>
>  If only 1 region, providing more than one nodes will probably just
>> slow down the test since the load is handled by one machine which has
>> to replicate blocks 2 times. I think your test would have much more
>> value if you really grew at least to 10 regions. Also make sure to run
>> the tests more than once on completely new hbase setups (drop table +
>> restart should be enough).
>>
>> May I also recommend upgrading to hbase 0.20.3? It will provide a
>> better experience in general.
>>
>> J-D
>>
>> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<vincent.barat@ubikod.com>
>>  wrote:
>>
>>> Unfortunately I can post only some snapshots.
>>>
>>> I have no region split (I insert just 100000 rows so there is no split,
>>> except when I don't use compression).
>>>
>>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>);
>>>
>>> The only difference between my 3 tests is the way I create the test
>>> table:
>>>
>>> HBaseAdmin admin = new HBaseAdmin(config);
>>>
>>> HTableDescriptor desc = new HTableDescriptor(name);
>>>
>>> HColumnDescriptor colDesc;
>>>
>>> colDesc = new HColumnDescriptor(Bytes.toBytes("meta:"));
>>> colDesc.setMaxVersions(1);
>>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE
>>> desc.addFamily(colDesc);
>>>
>>> colDesc = new HColumnDescriptor(Bytes.toBytes("data:"));
>>> colDesc.setMaxVersions(1);
>>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE
>>> desc.addFamily(colDesc);
>>>
>>> admin.createTable(desc);
>>>
>>> A typical row inserted is made of 13 columns with a short content, as
>>> show
>>> here:
>>>
>>> 1264761195240/6ffc3fe659023 column=data:accuracy,
>>> timestamp=1267006115356,
>>> value=1317
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356,
>>> value=0
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:country,
>>> timestamp=1267006115356,
>>> value=France
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:countrycode,
>>> timestamp=1267006115356, value=FR
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356,
>>> value=48.65869706
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:locality,
>>> timestamp=1267006115356,
>>> value=Morsang-sur-Orge
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356,
>>> value=2.36138182
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:postalcode,
>>> timestamp=1267006115356, value=91390
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=data:region, timestamp=1267006115356,
>>> value=Ile-de-France
>>>  a3c9cfed0a50a9f199ed42f2730
>>>  1264761195240/6ffc3fe659023 column=meta:imei, timestamp=1267006115356,
>>> value=6ffc3fe659023a3c9cfed0a50a9f199e
>>>  a3c9cfed0a50a9f199ed42f2730 d42f2730
>>>  1264761195240/6ffc3fe659023 column=meta:infoid, timestamp=1267006115356,
>>> value=ca30781e0c375a1236afbf323cbfa4
>>>  a3c9cfed0a50a9f199ed42f2730 0dc2c7c7af
>>>  1264761195240/6ffc3fe659023 column=meta:locid, timestamp=1267006115356,
>>> value=5e15a0281e83cfe55ec1c362f84a39f
>>>  a3c9cfed0a50a9f199ed42f2730 006f18128
>>>  1264761195240/6ffc3fe659023 column=meta:timestamp,
>>> timestamp=1267006115356,
>>> value=1264761195240
>>>  a3c9cfed0a50a9f199ed42f2730
>>>
>>> Maybe LZO works much better with fewer rows with bigger content?
>>>
>>> Le 24/02/10 19:10, Jean-Daniel Cryans a écrit :
>>>
>>>>
>>>> Are you able to post the code used for the insertion? It could be
>>>> something with your usage pattern or something wrong with the code
>>>> itself.
>>>>
>>>> How many rows are you inserting? Do you even have some region splits?
>>>>
>>>> J-D
>>>>
>>>> On Wed, Feb 24, 2010 at 1:52 AM, Vincent Barat<vincent.barat@ubikod.com
>>>> >
>>>>  wrote:
>>>>
>>>>>
>>>>> Yes of course.
>>>>>
>>>>> We use a 4 machine cluster (4 large instances on AWS): 8 GB RAM each,
>>>>> dual
>>>>> core CPU. 1 is for the Hadoop and HBase namenode / masters, and 3 are
>>>>> hosting the datanode / regionservers.
>>>>>
>>>>> The table used for testing is first created, then I insert sequentially
>>>>> a
>>>>> set of rows and count the nb of rows inserted by second.
>>>>>
>>>>> I insert rows by set of 1000 (using HTable.put(list<Put>);
>>>>>
>>>>> When reading, I read also sequentially by using a scanner (scanner
>>>>> caching
>>>>> is set to 1024 rows).
>>>>>
>>>>> Maybe our installation of LZO is not good ?
>>>>>
>>>>>
>>>>> Le 23/02/10 22:15, Jean-Daniel Cryans a écrit :
>>>>>
>>>>>>
>>>>>> Vincent,
>>>>>>
>>>>>> I don't expect that either, can you give us more info about your
test
>>>>>> environment?
>>>>>>
>>>>>> Thx,
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat
>>>>>> <vincent.barat@ubikod.com>      wrote:
>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I did some testing to figure out which compression algo I should
use
>>>>>>> for
>>>>>>> my
>>>>>>> HBase tables. I thought that LZO was the good candidate, but
it
>>>>>>> appears
>>>>>>> that
>>>>>>> it is the worst one.
>>>>>>>
>>>>>>> I uses one table with 2 families and 10 columns. Each row has
a total
>>>>>>> of
>>>>>>> 200
>>>>>>> to 400 bytes.
>>>>>>>
>>>>>>> Here is my results:
>>>>>>>
>>>>>>> GZIP:           2600 to 3200 inserts/s  12000 to 15000 reads/s
>>>>>>> NO COMPRESSION: 2000 to 2600 inserts/s  4900 to 5020 reads/s
>>>>>>> LZO             1600 to 2100 inserts/s  4020 to 4600 reads/s
>>>>>>>
>>>>>>> Do you have an explanation to this ? I though that the LZO
>>>>>>> compression
>>>>>>> was
>>>>>>> always faster at compression and decompression than GZIP ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message