hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy KS <bejoy...@yahoo.com>
Subject Re: How to run big queries in optimized way ?
Date Fri, 21 Sep 2012 06:06:33 GMT
Hi Mapred Learn


Please find my replies inline


> What are ways to reduce stress on our cluster for running many such big queries( include
joins too) in parallel ?
In some queries the generated map reduce jobs can run in parallel for that you need to set
'hive.exec.parallel' to 'true'.


> How to enable compression etc for intermediate hive output ?

You can enable compression in between map reduce jobs using 'hive.exec.compress.intermediate'
.

Compression for the map reduce jobs generated can be enabled by the following properties

hive.exec.compress.output
//final output compression

mapred.output.compress
mapred.output.compression.type
mapred.output.compression.codec
//map output compression

mapred.compress.map.output
mapred.map.output.compression.codec


> How to make job cache does not go to high etc ?
Hive determines the number of mappers intelligently but in some cases you need to specify
the suitable number of reducers as per your data set. If you have a sufficient memory allocated
for your child jvms and the slots are properly configured then there are least chances of
OOMs.  Also for processing large volume data you may need to increase the hive server heap
size as the number of splits could be immesnsly large as well as to avoid resource crunch
when we execute multiple queries in parallel.


> In short , best practices for huge queries on hive ?
You can go in for hive merge, if required for avoiding small files issue generated by queries.
Then optimization is totally based on what you use in your queries, you can go in with join
optimizations, group by optimizations etc based on your queries.


 
Regards,
Bejoy KS


----- Original Message -----
From: MiaoMiao <liy099@gmail.com>
To: user@hive.apache.org
Cc: 
Sent: Friday, September 21, 2012 8:10 AM
Subject: Re: How to run big queries in optimized way ?

Hive implements a format named RCFILE, which could gain better
performance, but in my project, it just ties with the plain-text
format.

Hive also have an index feature, but not so convenient or practical.

I think the best way to optimized is still reusing the same source
tables, avoiding sub-queries, and merge HiveQL as many as possible.
On Fri, Sep 21, 2012 at 10:30 AM, Mapred Learn <mapred.learn@gmail.com> wrote:
> Hi,
> We have datasets which are about 10-15 TB in size.
>
> We want to run hive queries on top of this input data.
>
> What are ways to reduce stress on our cluster for running many such big queries( include
joins too) in parallel ?
> How to enable compression etc for intermediate hive output ?
> How to make job cache does not go to high etc ?
> In short , best practices for huge queries on hive ?
>
> Any inputs are really appreciated !
>
> Thanks,
> JJ
>
> Sent from my iPhone

Mime
View raw message