hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhavesh Shah <bhavesh25s...@gmail.com>
Subject Re: Want to improve the performance for execution of Hive Jobs.
Date Tue, 08 May 2012 12:46:26 GMT
Thanks Bejoy for your reply.
Yes I saw that for ewvery job new XML is created. In that I saw that
whatever variable I set is different from that.
Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2
and In for all job XML it is showing value for  map is 1 and for reduce is
0.
Same thing are with other parameters too.
why is it?



On Tue, May 8, 2012 at 5:32 PM, Bejoy KS <bejoy_ks@yahoo.com> wrote:

> **
> Hi Bhavesh
> On a job level, if you set/override some properties it won't go into
> mapred-site.xml. Check your corresponding Job.xml to get the values. Also
> confirm from task logs that there is no warnings with respect to overriding
> those properties. If these two are good then you can confirm that the
> properties supplied by you are actually utilized for the job.
>
> Disclaimer: I'm not a EWS guy to comment on some specifics in there. My
> responses are related to generic hadoop behavior. :)
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Bhavesh Shah <bhavesh25shah@gmail.com>
> *Date: *Tue, 8 May 2012 17:15:44 +0530
> *To: *<user@hive.apache.org>; Bejoy Ks<bejoy_ks@yahoo.com>
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Want to improve the performance for execution of Hive Jobs.
>
> Hello Bejoy KS,
> I did in the same way by executing "hive -f  <filename>" on Amazon EMR.
> and when I observed the mapred-site.xml, all variables that I have set in
> above file are set by default with their values. I didn't see my set values.
>
> And the performance is slow too.
> I have tried this on my local cluster by setting this values and I saw
> some boost in the performance.
>
>
> On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:
>
>> Hi Bhavesh
>>
>>       I'm not sure of AWS, but from a quick reading cluster wide settings
>> like hdfs block size can be set on hdfs-site.xml through bootstrap actions.
>> Since you are changing hdfs block size set min and max split size across
>> the cluster using bootstrap actions itself. The rest of the properties can
>> on set on a per job level.
>>
>> Doesn't AWS provide an option to use "hive -f"? If so, just provide all
>> the properties required for tuning the query followed by queries(in order)
>> in a file and simply execute it using "hive -f <file name>".
>>
>> Regards
>> Bejoy KS
>>   ------------------------------
>> *From:* Bhavesh Shah <bhavesh25shah@gmail.com>
>> *To:* user@hive.apache.org; Bejoy Ks <bejoy_ks@yahoo.com>
>> *Sent:* Tuesday, May 8, 2012 3:33 PM
>>
>> *Subject:* Re: Want to improve the performance for execution of Hive
>> Jobs.
>>
>> Thanks Bejoy KS for your reply,
>> I want to ask one thing that If I want to set this parameter on Amazon
>> Elastic Mapreduce then how can I set these variable like:
>> e.g. SET mapred.min.split.size=m;
>>       SET mapred.max.split.size=m+n;
>>       set dfs.block.size=128
>>       set mapred.compress.map.output=true
>>       set io.sort.mb=400  etc....
>>
>> For all this do I need to write shell script for setting this variables
>> on the particular path /home/hadoop/hive/bin/hive -e 'set .....'
>> or pass all this steps in bootstrap actions???
>>
>> I found this link to pass the bootstrap actions
>>
>> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined
>>
>> What should I do in such case??
>>
>>
>>
>> On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:
>>
>> Hi Bhavesh
>>
>>      In sqoop you can optimize the performance by using --direct mode for
>> import and increasing the number of mappers used for import. When you
>> increase the number of mappers you need to ensure that the RDBMS connection
>> pool will handle those number of connections gracefully. Also use a evenly
>> distributed column as --split-by, that'll ensure that all mappers are kind
>> of equally loaded.
>>    min split size and map split size can be set on a job level. But,
>> there are chances of slight loss in data locality if you increase these
>> values. By increasing these values you are increasing the data volume
>> processed per mapper so less number of mappers , now you need to see
>> whether this will that get you substantial performance gains. I havent seen
>> much gains there when I tried out those on some of my workflows in the
>> past. A better approach than this would be increasing the hdfs block size
>> itself if your cluster deals with relatively larger files. Of
>> you change the hdfs block size then make the changes accordingly on min
>> split and max split values.
>>     You can set all min and max split sizes using SET command in hive CLI
>> itself.
>> hive> SET mapred.min.split.size=m;
>> hive> SET mapred.max.split.size=m+n;
>>
>> Regards
>> Bejoy KS
>>
>>
>>   ------------------------------
>> *From:* Bhavesh Shah <bhavesh25shah@gmail.com>
>> *To:* user@hive.apache.org
>> *Sent:* Tuesday, May 8, 2012 11:35 AM
>> *Subject:* Re: Want to improve the performance for execution of Hive
>> Jobs.
>>
>> Thanks Both of you for their replies,
>> If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>>
>> 1) Default block size is 64 MB, so insuch case I have to set it to 128
>> MB..... is it right???
>> 2) Amazon EMR has already values for  mapred.min.split.size
>> and mapred.max.split.size, and mapper and reducer too. So is there any need
>> to set the values there? If yes then how to set for all clusters? Is it
>> possible by setting all these above parameters in --bootstrap-actions....
>> to apply this for all nodes while submitting jobs to Amazon EMR??
>>
>> Thanks both of u very much
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <mapred.learn@gmail.com>wrote:
>>
>> Try setting this value to your block
>> Size, for 128 mb block size,
>>
>> *set mapred.min.split.size=128000*
>>
>>
>> Sent from my iPhone
>>
>> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bhavesh25shah@gmail.com>
>> wrote:
>>
>> Thanks Nitin for your reply.
>>
>> In short my Task is
>> 1) Initially I want to import the data from MS SQL Server into HDFS using
>> SQOOP.
>> 2) Through Hive I am processing the data and generating the result in one
>> table
>> 3) That result containing table from Hive is again exported to MS SQL
>> SERVER back.
>>
>> Actually the data which I am importing from MS SQL Server is very large
>> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
>> this I have written a task in Hive which contains only queries (And each
>> query has used a lot of joins in it). So due to this the performance is
>> very poor on  my single local machine ( It takes near about 3 hrs to
>> execute completely). I have observed that when I have submitted a single
>> query to Hive CLI it took 10-11 jobs to execute completely.
>>
>> * set mapred.min.split.size
>> set mapred.max.split.size*
>> Should this value to be set in bootstrap action while submitting jobs to
>> amazon EMR? What value to be set for it as I don't know?
>>
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <nitinpawar432@gmail.com>
>> nitinpawar432@gmail.com> wrote:
>>
>> 1) check the jobtracker url to see how many maps/reducers have been
>> launched
>> 2) if you have a large dataset and wants to execute it fast, you
>> set mapred.min.split.size and mapred.max.split.size to an optimal value so
>> that more mappers will be launched and will finish
>> 3) if you are doing joins, there are different ways to go according to
>> the data you have and size of data
>>
>> it will be helpful if you can let us know your datasizes and query
>> details
>>
>>
>> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bhavesh25shah@gmail.com>
>> bhavesh25shah@gmail.com> wrote:
>>
>> Hello all,
>> I have written a Hive JDBC code and created a JAR of it. I am running
>> that JAR on 10 cluster.
>> But the problem as I am using the 10 cluster still the performance is
>> same as that on single cluster.
>>
>> What to do to improve the performance of Hive Jobs? Is there anything
>> configuration setting to set before the submitting Hive Jobs to cluster?
>> One more thing I want to know is that How can we come to know that is job
>> running on all cluster?
>>
>> Please let me know if anyone knows about it?
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>>
>>
>
>
> --
> Regards,
> Bhavesh Shah
>
>


-- 
Regards,
Bhavesh Shah

Mime
View raw message