hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Disable Hive autogather optimization
Date Fri, 29 Apr 2016 23:35:56 GMT
Hi Unit,

*For new tables*

Disable stats autogathering in Hive when creating a new table  and
populating it

SET hive.stats.autogather=false;

*Already existing tables*

As a work-around you can try this on the already existing tables  by manually
alter the numRows to -1

ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES
('numRows'='-1');

Example

0: jdbc:hive2://rhes564:10010/default> create table testme as select * from
sales_staging limit 1000;

0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;

+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
|           col_name            |
data_type                            |           comment           |
+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
| # col_name                    |
data_type                                                       |
comment                     |
|                               |
NULL                                                            |
NULL                        |
| prod_id                       |
double
|                             |
| cust_id                       |
double
|                             |
| time_id                       |
string
|                             |
| channel_id                    |
double
|                             |
| promo_id                      |
double
|                             |
| quantity_sold                 |
double
|                             |
| amount_sold                   |
double
|                             |
|                               |
NULL                                                            |
NULL                        |
| # Detailed Table Information  |
NULL                                                            |
NULL                        |
| Database:                     |
oraclehadoop                                                    |
NULL                        |
| Owner:                        |
hduser                                                          |
NULL                        |
| CreateTime:                   | Sat Apr 30 00:31:17 BST
2016                                    | NULL                        |
| LastAccessTime:               |
UNKNOWN                                                         |
NULL                        |
| Retention:                    |
0                                                               |
NULL                        |
| Location:                     |
hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
NULL                        |
| Table Type:                   |
MANAGED_TABLE                                                   |
NULL                        |
| Table Parameters:             |
NULL                                                            |
NULL                        |
|                               |
COLUMN_STATS_ACCURATE                                           |
{\"BASIC_STATS\":\"true\"}  |
|                               |
numFiles                                                        |
1                           |
|                               |
numRows                                                         |
1000                        |
|                               |
rawDataSize                                                     |
54853                       |
|                               |
totalSize                                                       |
55853                       |
|                               |
transient_lastDdlTime                                           |
1461972677                  |
|                               |
NULL                                                            |
NULL                        |
| # Storage Information         |
NULL                                                            |
NULL                        |
| SerDe Library:                |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
NULL                        |
| InputFormat:                  |
org.apache.hadoop.mapred.TextInputFormat                        |
NULL                        |
| OutputFormat:                 |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
NULL                        |
| Compressed:                   |
No                                                              |
NULL                        |
| Num Buckets:                  |
-1                                                              |
NULL                        |
| Bucket Columns:               |
[]                                                              |
NULL                        |
| Sort Columns:                 |
[]                                                              |
NULL                        |
| Storage Desc Params:          |
NULL                                                            |
NULL                        |
|                               |
serialization.format                                            |
1                           |
+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+

0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme  SET
TBLPROPERTIES ('numRows'='-1');*

0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;

+-------------------------------+-----------------------------------------------------------------+-----------------------+--+
|           col_name            |
data_type                            |        comment        |
+-------------------------------+-----------------------------------------------------------------+-----------------------+--+
| # col_name                    |
data_type                                                       |
comment               |
|                               |
NULL                                                            |
NULL                  |
| prod_id                       |
double
|                       |
| cust_id                       |
double
|                       |
| time_id                       |
string
|                       |
| channel_id                    |
double
|                       |
| promo_id                      |
double
|                       |
| quantity_sold                 |
double
|                       |
| amount_sold                   |
double
|                       |
|                               |
NULL                                                            |
NULL                  |
| # Detailed Table Information  |
NULL                                                            |
NULL                  |
| Database:                     |
oraclehadoop                                                    |
NULL                  |
| Owner:                        |
hduser                                                          |
NULL                  |
| CreateTime:                   | Sat Apr 30 00:31:17 BST
2016                                    | NULL                  |
| LastAccessTime:               |
UNKNOWN                                                         |
NULL                  |
| Retention:                    |
0                                                               |
NULL                  |
| Location:                     |
hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
NULL                  |
| Table Type:                   |
MANAGED_TABLE                                                   |
NULL                  |
| Table Parameters:             |
NULL                                                            |
NULL                  |
|                               |
last_modified_by                                                |
hduser                |
|                               |
last_modified_time                                              |
1461973002            |
|                               |
numFiles                                                        |
1                     |
|                               |
numRows                                                         |
-1                    |
|                               |
rawDataSize                                                     |
54853                 |
|                               |
totalSize                                                       |
55853                 |
|                               |
transient_lastDdlTime                                           |
1461973002            |
|                               |
NULL                                                            |
NULL                  |
| # Storage Information         |
NULL                                                            |
NULL                  |
| SerDe Library:                |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
NULL                  |
| InputFormat:                  |
org.apache.hadoop.mapred.TextInputFormat                        |
NULL                  |
| OutputFormat:                 |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
NULL                  |
| Compressed:                   |
No                                                              |
NULL                  |
| Num Buckets:                  |
-1                                                              |
NULL                  |
| Bucket Columns:               |
[]                                                              |
NULL                  |
| Sort Columns:                 |
[]                                                              |
NULL                  |
| Storage Desc Params:          |
NULL                                                            |
NULL                  |
|                               |
serialization.format                                            |
1                     |
+-------------------------------+-----------------------------------------------------------------+-----------------------+--+

Hopefully that will turn off the autogather feature for existing tables.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 29 April 2016 at 23:32, Udit Mehta <umehta@groupon.com> wrote:

> Hi,
>
> Thanks for the replies.
> We have a scenario where we have an ETL job inserting into a table with
> thousands of partitions using dynamic partitioning. We have certain SLA's
> within which we would like the job to finish and sometimes there are
> scenarios where they are missed (extra data or a busy cluster). I
> understand that stats are essential for Hive CBO but we are trying to
> explore how much overhead do these stats collection add to the job runtime.
> A lot of these tables are intermediary tables so having stats for them
> might not be entirely necessary.
>
> I just wanted to figure if there was a easy way to disable the stats and
> then compare the performance.
>
> Mich, can you give more information on how to disable it in the table
> struct as I cant find any documentation on it.
>
> Thanks again.
> Udit
>
> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <pxiong@apache.org>
> wrote:
>
>> Hi Udit,
>>
>>     Could u be more specific about your problem? Like, what settings you
>> have, what query you run and what is the result and what result do you
>> expect?
>>
>>     From what you said, my understanding is that, you want to wipe out
>> the basic stats for existing tables? And, could u also let us know why you
>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>> and we are moving towards the direction to make table/column stats
>> collection automatically. It seems that you prefer an opposite direction.
>> There is nothing wrong here and we would like to listen to your idea and
>> motivation so that we can better design Hive stats collection. Thanks!
>>
>> Best
>> Pengcheng
>>
>>
>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <umehta@groupon.com> wrote:
>>
>>> Any insights on this?
>>>
>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <umehta@groupon.com> wrote:
>>>
>>>> Update: Realized this works if we create a fresh table with this config
>>>> already disabled but does not work if there is already a table created when
>>>> this config was enabled. We now need to figure out how to disable this
>>>> config for a table created when this config was true.
>>>>
>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <umehta@groupon.com> wrote:
>>>>
>>>>> Hive version we are using is 1.2.1.
>>>>>
>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <umehta@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We need to disable the Hive autogather stats optimization by
>>>>>> disabling "*hive.stats.autogather*" but for some reason, the config
>>>>>> change doesnt seem to go through. We modified this config in the
>>>>>> hive-site.xml and restarted the Hive metastore. We also made this
change
>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *set hive.stats.autogather=false;*
>>>>>> Does anyone know the right way to disable this config since we dont
>>>>>> want to compute stats in out jobs.
>>>>>>
>>>>>> Thanks,
>>>>>> Udit
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message