hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Udit Mehta <ume...@groupon.com>
Subject Re: Disable Hive autogather optimization
Date Sat, 30 Apr 2016 00:39:51 GMT
thanks Mich. I will test this out and get back to you!

On Fri, Apr 29, 2016 at 4:42 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> apologies should read "Udit"
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 April 2016 at 00:35, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
>> Hi Unit,
>>
>> *For new tables*
>>
>> Disable stats autogathering in Hive when creating a new table  and
>> populating it
>>
>> SET hive.stats.autogather=false;
>>
>> *Already existing tables*
>>
>> As a work-around you can try this on the already existing tables  by manually
>> alter the numRows to -1
>>
>> ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES
>> ('numRows'='-1');
>>
>> Example
>>
>> 0: jdbc:hive2://rhes564:10010/default> create table testme as select *
>> from sales_staging limit 1000;
>>
>> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>>
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>> |           col_name            |
>> data_type                            |           comment           |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>> | # col_name                    |
>> data_type                                                       |
>> comment                     |
>> |                               |
>> NULL                                                            |
>> NULL                        |
>> | prod_id                       |
>> double
>> |                             |
>> | cust_id                       |
>> double
>> |                             |
>> | time_id                       |
>> string
>> |                             |
>> | channel_id                    |
>> double
>> |                             |
>> | promo_id                      |
>> double
>> |                             |
>> | quantity_sold                 |
>> double
>> |                             |
>> | amount_sold                   |
>> double
>> |                             |
>> |                               |
>> NULL                                                            |
>> NULL                        |
>> | # Detailed Table Information  |
>> NULL                                                            |
>> NULL                        |
>> | Database:                     |
>> oraclehadoop                                                    |
>> NULL                        |
>> | Owner:                        |
>> hduser                                                          |
>> NULL                        |
>> | CreateTime:                   | Sat Apr 30 00:31:17 BST
>> 2016                                    | NULL                        |
>> | LastAccessTime:               |
>> UNKNOWN                                                         |
>> NULL                        |
>> | Retention:                    |
>> 0                                                               |
>> NULL                        |
>> | Location:                     |
>> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
>> NULL                        |
>> | Table Type:                   |
>> MANAGED_TABLE                                                   |
>> NULL                        |
>> | Table Parameters:             |
>> NULL                                                            |
>> NULL                        |
>> |                               |
>> COLUMN_STATS_ACCURATE                                           |
>> {\"BASIC_STATS\":\"true\"}  |
>> |                               |
>> numFiles                                                        |
>> 1                           |
>> |                               |
>> numRows                                                         |
>> 1000                        |
>> |                               |
>> rawDataSize                                                     |
>> 54853                       |
>> |                               |
>> totalSize                                                       |
>> 55853                       |
>> |                               |
>> transient_lastDdlTime                                           |
>> 1461972677                  |
>> |                               |
>> NULL                                                            |
>> NULL                        |
>> | # Storage Information         |
>> NULL                                                            |
>> NULL                        |
>> | SerDe Library:                |
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
>> NULL                        |
>> | InputFormat:                  |
>> org.apache.hadoop.mapred.TextInputFormat                        |
>> NULL                        |
>> | OutputFormat:                 |
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
>> NULL                        |
>> | Compressed:                   |
>> No                                                              |
>> NULL                        |
>> | Num Buckets:                  |
>> -1                                                              |
>> NULL                        |
>> | Bucket Columns:               |
>> []                                                              |
>> NULL                        |
>> | Sort Columns:                 |
>> []                                                              |
>> NULL                        |
>> | Storage Desc Params:          |
>> NULL                                                            |
>> NULL                        |
>> |                               |
>> serialization.format                                            |
>> 1                           |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>>
>> 0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme  SET
>> TBLPROPERTIES ('numRows'='-1');*
>>
>> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>>
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>> |           col_name            |
>> data_type                            |        comment        |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>> | # col_name                    |
>> data_type                                                       |
>> comment               |
>> |                               |
>> NULL                                                            |
>> NULL                  |
>> | prod_id                       |
>> double
>> |                       |
>> | cust_id                       |
>> double
>> |                       |
>> | time_id                       |
>> string
>> |                       |
>> | channel_id                    |
>> double
>> |                       |
>> | promo_id                      |
>> double
>> |                       |
>> | quantity_sold                 |
>> double
>> |                       |
>> | amount_sold                   |
>> double
>> |                       |
>> |                               |
>> NULL                                                            |
>> NULL                  |
>> | # Detailed Table Information  |
>> NULL                                                            |
>> NULL                  |
>> | Database:                     |
>> oraclehadoop                                                    |
>> NULL                  |
>> | Owner:                        |
>> hduser                                                          |
>> NULL                  |
>> | CreateTime:                   | Sat Apr 30 00:31:17 BST
>> 2016                                    | NULL                  |
>> | LastAccessTime:               |
>> UNKNOWN                                                         |
>> NULL                  |
>> | Retention:                    |
>> 0                                                               |
>> NULL                  |
>> | Location:                     |
>> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
>> NULL                  |
>> | Table Type:                   |
>> MANAGED_TABLE                                                   |
>> NULL                  |
>> | Table Parameters:             |
>> NULL                                                            |
>> NULL                  |
>> |                               |
>> last_modified_by                                                |
>> hduser                |
>> |                               |
>> last_modified_time                                              |
>> 1461973002            |
>> |                               |
>> numFiles                                                        |
>> 1                     |
>> |                               |
>> numRows                                                         |
>> -1                    |
>> |                               |
>> rawDataSize                                                     |
>> 54853                 |
>> |                               |
>> totalSize                                                       |
>> 55853                 |
>> |                               |
>> transient_lastDdlTime                                           |
>> 1461973002            |
>> |                               |
>> NULL                                                            |
>> NULL                  |
>> | # Storage Information         |
>> NULL                                                            |
>> NULL                  |
>> | SerDe Library:                |
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
>> NULL                  |
>> | InputFormat:                  |
>> org.apache.hadoop.mapred.TextInputFormat                        |
>> NULL                  |
>> | OutputFormat:                 |
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
>> NULL                  |
>> | Compressed:                   |
>> No                                                              |
>> NULL                  |
>> | Num Buckets:                  |
>> -1                                                              |
>> NULL                  |
>> | Bucket Columns:               |
>> []                                                              |
>> NULL                  |
>> | Sort Columns:                 |
>> []                                                              |
>> NULL                  |
>> | Storage Desc Params:          |
>> NULL                                                            |
>> NULL                  |
>> |                               |
>> serialization.format                                            |
>> 1                     |
>>
>> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>>
>> Hopefully that will turn off the autogather feature for existing tables.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 April 2016 at 23:32, Udit Mehta <umehta@groupon.com> wrote:
>>
>>> Hi,
>>>
>>> Thanks for the replies.
>>> We have a scenario where we have an ETL job inserting into a table with
>>> thousands of partitions using dynamic partitioning. We have certain SLA's
>>> within which we would like the job to finish and sometimes there are
>>> scenarios where they are missed (extra data or a busy cluster). I
>>> understand that stats are essential for Hive CBO but we are trying to
>>> explore how much overhead do these stats collection add to the job runtime.
>>> A lot of these tables are intermediary tables so having stats for them
>>> might not be entirely necessary.
>>>
>>> I just wanted to figure if there was a easy way to disable the stats and
>>> then compare the performance.
>>>
>>> Mich, can you give more information on how to disable it in the table
>>> struct as I cant find any documentation on it.
>>>
>>> Thanks again.
>>> Udit
>>>
>>> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <pxiong@apache.org>
>>> wrote:
>>>
>>>> Hi Udit,
>>>>
>>>>     Could u be more specific about your problem? Like, what settings
>>>> you have, what query you run and what is the result and what result do you
>>>> expect?
>>>>
>>>>     From what you said, my understanding is that, you want to wipe out
>>>> the basic stats for existing tables? And, could u also let us know why you
>>>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>>>> and we are moving towards the direction to make table/column stats
>>>> collection automatically. It seems that you prefer an opposite direction.
>>>> There is nothing wrong here and we would like to listen to your idea and
>>>> motivation so that we can better design Hive stats collection. Thanks!
>>>>
>>>> Best
>>>> Pengcheng
>>>>
>>>>
>>>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <umehta@groupon.com> wrote:
>>>>
>>>>> Any insights on this?
>>>>>
>>>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <umehta@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> Update: Realized this works if we create a fresh table with this
>>>>>> config already disabled but does not work if there is already a table
>>>>>> created when this config was enabled. We now need to figure out how
to
>>>>>> disable this config for a table created when this config was true.
>>>>>>
>>>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <umehta@groupon.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hive version we are using is 1.2.1.
>>>>>>>
>>>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <umehta@groupon.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We need to disable the Hive autogather stats optimization
by
>>>>>>>> disabling "*hive.stats.autogather*" but for some reason,
the
>>>>>>>> config change doesnt seem to go through. We modified this
config in the
>>>>>>>> hive-site.xml and restarted the Hive metastore. We also made
this change
>>>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *set hive.stats.autogather=false;*
>>>>>>>> Does anyone know the right way to disable this config since
we dont
>>>>>>>> want to compute stats in out jobs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Udit
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message