hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Loading data from Hive to HBase takes too long
Date Mon, 19 Aug 2013 23:51:41 GMT
Hi Hao,

how do you run HBase in pseudo distributed mode, yet with 3 slaves?
Where is the data written in EC2? EBS or local storage?
Did you do any other tuning at the HBase or HDFS level (server side)?

If your replication level is still set to 3 you're seeing somewhat of a worst case scenario,
where each node gets 100% of all writes, and the speed is always dominated by your slowest
machine.
How does Hive perform here when you write to HDFS directly?

Sorry, many questions :)

-- Lars

________________________________
From: Hao Ren <h.ren@claravista.fr>
To: user@hbase.apache.org 
Sent: Monday, August 19, 2013 1:50 AM
Subject: Re: Loading data from Hive to HBase takes too long


Update:

There are 1 master and 3 slaves in my cluster.
They are all m1.medium instances.

*Instance Family*     *Instance Type*     *Processor Arch*     *vCPU*     *ECU*

*Memory (GiB)*     *Instance Storage (GB)*     *EBS-optimized Available* 
*Network Performance*









General purpose     m1.medium     32-bit or
64-bit     1     2     3.75     1 x 410     -     Moderate


Le 19/08/2013 10:44, Hao Ren a écrit :
> Update:
>
> I messed up some queries, here are the right ones:
>
> CREATE TABLE hbase_table (
> material_id int,
> new_id_client int,
> last_purchase_date int)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
> ":key,cf1:idclt,cf1:dt_last_purchase")
> TBLPROPERTIES("hbase.table.name" = "test");
>
> insert OVERWRITE TABLE hbase_table
> select * from test;  -- takes a long time (about 8 hours)
>
> # bin/hadoop dfs -dus /user/hive/warehouse/test
> hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/user/hive/warehouse/test 
> 1318012108
>
> the table 'test' is just about 1.3 GB.
>
>
>
> Le 19/08/2013 10:40, Hao Ren a écrit :
>> Hi,
>>
>> I am runing Hive and Hbase on the same Amazon EC2 cluster, where 
>> Hbase is in a pseudo-distributed mode.
>>
>> After integrating HBase in Hive, I find that it takes a long time 
>> when runing a "insert overwrite" query from hive in order to load 
>> data into a related HBase table.
>>
>> In fact, the size of data is about 1.3Gb. I dont think it's normal.
>>
>> Maybe there are something wrong with my configuration.
>>
>> Here are some queries:
>>
>> CREATE TABLE hbase_table (
>> material_id int,
>> new_id_client int,
>> last_purchase_date int)
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
>> ":key,cf1:idclt,cf1:dt_last_purchase")
>> TBLPROPERTIES("hbase.table.name" = "test");
>>
>> insert OVERWRITE TABLE t_LIGNES_DERN_VENTES
>> select * from test;  -- takes a long time (about 8 hours)
>>
>>
>> Here are some configurations files for my cluster :
>>
>> # cat hive/conf/hive-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <configuration>
>>
>>     <property>
>>         <name>hbase.zookeeper.quorum</name>
>>         <value>ip-10-159-41-177.ec2.internal</value>
>>     </property>
>>
>>     <property>
>>         <name>hive.aux.jars.path</name>
>> <value>/root/hive/build/dist/lib/hive-hbase-handler-0.9.0-amplab-4.jar,/root/hive/build/dist/lib/hbase-0.92.0.jar,/root/hive/build/dist/lib/zookeeper-3.4.3.jar,/root/hive/build/dist/lib/guava-r09.jar</value>

>>
>>     </property>
>>
>>     <property>
>>         <name>hbase.client.scanner.caching</name>
>>         <value>10000</value>
>>     </property>
>>
>> </configuration>
>>
>> # cat hbase-0.92.0/conf/hbase-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <configuration>
>>
>>     <property>
>>         <name>hbase.rootdir</name>
>> <value>hdfs://ec2-54-234-17-36.compute-1.amazonaws.com:9010/hbase</value>

>>
>>     </property>
>>
>>     <property>
>>         <name>hbase.cluster.distributed</name>
>>         <value>true</value>
>>     </property>
>>
>>     <property>
>>         <name>hbase.zookeeper.quorum</name>
>>         <value>ip-10-159-41-177.ec2.internal</value>
>>     </property>
>>
>>     <property>
>>         <name>hbase.client.scanner.caching</name>
>>         <value>10000</value>
>>     </property>
>>
>> </configuration>
>>
>> Any help is highly appreciated!
>>
>> Thank you.
>>
>> Hao
>>
>
>


-- 
Hao Ren
ClaraVista
www.claravista.fr                                

Mime
View raw message