hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase performance tuning
Date Fri, 28 Mar 2008 12:09:04 GMT
Ok, So I picked up and modified the code for my use and tried it with
different
configurations varying the no. of reduces in each run (10, 20, 40, 80,
200) and 
the best throughput I could get (with 200 reducers) was 4306
inserts/sec. 
The total runtime being 17 min. for 4.38 million seeds.

Using my threaded client running 200 threads I managed same number of
inserts 
in 12 min.

Looks like Map-Red insert is slower than our regular threaded insert.
Can gain performance via any other tweak ? 
If not then is there any reasonable scope of performance improvement of
Hbase
via code optimization ? 

(I wouldn't mind taking a deep dive into the code to optimize core HBase
memory
  structures and contribute to HBase)

Thanks
-Ankur

    

-----Original Message-----
From: stack [mailto:stack@duboce.net] 
Sent: Thursday, March 27, 2008 12:05 AM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

I just posted EXAMPLE code to the hbase MR wiki page: 
http://wiki.apache.org/hadoop/Hbase/MapReduce
St.Ack




Naama Kraus wrote:
> Hi,
>
> A sample MapReduce for an insert would be interesting to me also !
>
> Naama
>
> On Tue, Mar 25, 2008 at 3:54 PM, stack <stack@duboce.net> wrote:
>
>   
>> Your insert is single-threaded?  At a minimum your program should be 
>> multithreaded.  Randomize the keys on your data so that the inserts 
>> are spread across your 9 regionservers.  Better if you spend a bit of

>> time and write a mapreduce job to do the insert (If you want a 
>> sample, write the list again and I'll put something together).
>> St.Ack
>>
>> ANKUR GOEL wrote:
>>     
>>> Hi Folks,
>>>             I have a table with the following column families in the

>>> schema
>>>        {"referer_id:", "100"},  (Integer here is max length)
>>>        {"url:","1500"},
>>>        {"site:","500"},
>>>        {"status:","100"}
>>>
>>> The common attributes for all the above column families are [max 
>>> versions: 1,  compression: NONE, in memory: false, block cache 
>>> enabled: true, max length: 100, bloom filter: none]
>>>
>>> [HBase Configuration]:
>>>   - HDFS runs on 10 machine nodes with 8 GB RAM each and 4 CPU
cores.
>>>   - HMaster runs on a different machine than NameNode.
>>>   - There are 9 regionserves configured
>>>   - Total DFS available  = 150 GB.
>>>   - LAN speed in 100 Mbps
>>>
>>> I am trying to insert approx 4.8 million rows and the speed that I 
>>> get is around 1500 row inserts per sec (100,000 row inserts per
min.).
>>>
>>> It takes around 50 min to insert all the seeds. The Java program 
>>> that does the inserts uses buffered I/O to read the the data from a
>>>       
>> local
>>     
>>> file and runs on the same machine as the HMaster.To give you an idea

>>> of Java code that does the insert here is a snapshot of the loop.
>>>
>>> while ((url = seedReader.readLine()) != null) {
>>>      try {
>>>        BatchUpdate update = new BatchUpdate(new 
>>> Text(md5(normalizedUrl)));
>>>        update.put(new Text("url:"), getBytes(url));
>>>        update.put(new Text("site:"), getBytes(new
URL(url).getHost()));
>>>        update.put(new Text("status:"), getBytes(status));
>>>        seedlist.commit(update); // seedlist is the HTable
>>>       }
>>> ....
>>> ....
>>>
>>> Is there a way to tune HBase to achieve better I/O speeds ?
>>> Ideally I would like to reduce the total insert time to less than 15

>>> min i.e achieve an insert speed of around 4500 rows/sec or more.
>>>
>>> Thanks
>>> -Ankur
>>>
>>>
>>>       
>>     
>
>
>   


Mime
View raw message