hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <h....@claravista.fr>
Subject HBase-Hive integration performance issues
Date Tue, 27 Aug 2013 13:51:03 GMT
Hi,

I am running Hive and HBase on Amazon EC2. By following the tutorial: 
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration , I 
managed to create a HBase table from Hive and insert data into it.

It works but with a low performance. To be specific, inserting 1.3 Gb 
(50 M rows, 3 columns) takes 30 mins. It is far from what I excepted, 
say 100 s.

Actually, my EC2 cluster contains 3 slaves and 1 master whose instance 
type is medium(http://aws.amazon.com/ec2/instance-types/#instance-type).

Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed 
mode. A region server is running on the master. HDFS is used as storage.

Here are some configuration files:

*// hive-site.xml*

<configuration>

     <property>
         <name>hbase.zookeeper.quorum</name>
         <value>ip-10-178-13-39.ec2.internal</value>
     </property>

     <property>
         <name>hive.aux.jars.path</name>
<value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>
     </property>

     <property>
         <name>hbase.client.scanner.caching</name>
         <value>10000</value>
     </property>

</configuration>

*// hbase-site.xml*

<configuration>

     <property>
         <name>hbase.rootdir</name>
<value>hdfs://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase</value>
     </property>

     <property>
         <name>hbase.cluster.distributed</name>
         <value>true</value>
     </property>

     <property>
         <name>hbase.zookeeper.quorum</name>
         <value>ip-10-178-13-39.ec2.internal</value>
     </property>

     <property>
         <name>hbase.client.scanner.caching</name>
         <value>10000</value>
     </property>

</configuration>

*For understanding, I have some questions:*
1) In order to improve read performance, I have set 
hbase.client.scanner.caching to 10000. But I don't know how to improve 
write performance. Is there some basic config to do ?
2) Does the distributed mode matter ? Does fully-distributed mode have 
better write performance than pseudo-distributed mode ?
3) If the number of region server is increased, will the write 
performance be improved ?
4) In pseudo-distributed mode (one hbase daemon on master), when writing 
data from hive to a hbase table, is the master the only entry to HBase ? 
I don't think all data passes through the master is efficient. I wonder 
whether it is possible write data in parallel from hive to hbase 
directly in using mapReduce ?
5) Will the HBase bulk loading help a lot ?

I am new to HBase, but I really want to integrate HBase in production.

Any help is highly appreciated ! =)

Hao

-- 
Hao Ren
ClaraVista
www.claravista.fr


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message