hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <>
Subject Re: Hive-645 is slow to insert query results to mysql
Date Fri, 09 Mar 2012 18:11:10 GMT
Hi Wei
     +1 to Viral's comments.
We have implemented quite a few production systems that uses SQOOP to do the data transfer
between hdfs and jdbc compliant databases. Go for it as it is an awesome tool. You can find
the usage instructions on SQOOP developer guide at 


 From: Viral Bajaria <>
To: "Lu, Wei" <>; "" <>

Sent: Friday, March 9, 2012 11:19 PM
Subject: RE: Hive-645 is slow to insert query results to mysql

Hey Wei,

I have used the udf before and figured it is only useful for summary results and not for big
datasets due to the fault tolerant nature of map/reduce. If you don't have a well defined
primary key you will end up with more rows than your query results. And you are correct in
saying that this is not a bulk insert since the udf executes at the select statement and hence
it processes each returned row.

You can try your solution using swoop, it seems to be the most common way of getting data
out into DBs though I have not used it personally. 


From: Lu, Wei
Sent: 3/8/2012 6:58 PM
Subject: Hive-645 is slow to insert query results to mysql

I recently tried Hive-645 feature and save query results directly to Mysql table. The feature
can be found here:
The query I tried looks like this:
hive>CREATE TEMPORARY FUNCTION dboutput AS 'org.apache.hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput';
hive>SELECT dboutput('jdbc:mysql://localhost/hdfs_db','uid','pwd','INSERT INTO dc(t,c)
VALUES (?,?)',requestbegintime,count(1)) FROM impressions2 GROUP BY requestbegintime;
It works, but the reduce tasks are very slow:
Task Complete Status Start Time Finish Time Errors Counters 
task_201203081246_0001_r_000000 100.00%  reduce > reduce 8-Mar-2012 12:47:06 8-Mar-2012
13:33:54 (46mins, 47sec)  11 
I set #reduce to be 4 but is still very slow (finally 171, 667 rows are inserted to Mysql).
I guess the reduce process didn’t insert data to MySql in batch mode, can anyone give me
some suggestions to improve the performance?? 
PS: I think it might be better to first save results to HDFS and then use Sqoop to load data
to Mysql, right??
View raw message