hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mapred Learn <mapred.le...@gmail.com>
Subject Re: Want to improve the performance for execution of Hive Jobs.
Date Tue, 08 May 2012 05:49:31 GMT
Try setting this value to your block
Size, for 128 mb block size,

> set mapred.min.split.size=128000

Sent from my iPhone

On May 7, 2012, at 10:11 PM, Bhavesh Shah <bhavesh25shah@gmail.com> wrote:

> Thanks Nitin for your reply.
> 
> In short my Task is 
> 1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
> 2) Through Hive I am processing the data and generating the result in one table
> 3) That result containing table from Hive is again exported to MS SQL SERVER back.
> 
> Actually the data which I am importing from MS SQL Server is very large (near about 5,00,000
entries in one table. Like wise I have 30 tables). For this I have written a task in Hive
which contains only queries (And each query has used a lot of joins in it). So due to this
the performance is very poor on  my single local machine ( It takes near about 3 hrs to execute
completely). I have observed that when I have submitted a single query to Hive CLI it took
10-11 jobs to execute completely.
> 
> set mapred.min.split.size 
> set mapred.max.split.size
> Should this value to be set in bootstrap action while submitting jobs to amazon EMR?
What value to be set for it as I don't know?
> 
> 
> -- 
> Regards,
> Bhavesh Shah
> 
> 
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <nitinpawar432@gmail.com> wrote:
> 1) check the jobtracker url to see how many maps/reducers have been launched
> 2) if you have a large dataset and wants to execute it fast, you set mapred.min.split.size
and mapred.max.split.size to an optimal value so that more mappers will be launched and will
finish 
> 3) if you are doing joins, there are different ways to go according to the data you have
and size of data 
> 
> it will be helpful if you can let us know your datasizes and query details 
> 
> 
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bhavesh25shah@gmail.com> wrote:
> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that JAR on 10
cluster.
> But the problem as I am using the 10 cluster still the performance is same as that on
single cluster.
> 
> What to do to improve the performance of Hive Jobs? Is there anything configuration setting
to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job running on
all cluster?
> 
> Please let me know if anyone knows about it?
> 
> -- 
> Regards,
> Bhavesh Shah
> 
> 
> 
> 
> -- 
> Nitin Pawar
> 
> 
> 

Mime
View raw message