hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Häusler <mich...@akatose.de>
Subject Re: column statistics for non-primitive types
Date Tue, 14 Jun 2016 21:51:01 GMT
Hi Mich,

as we are still on Hive 1.2.1, it is only working like this for basic stats.
I would welcome it though, if it would work for column statistics as well - and it seems this
feature is coming via HIVE-11160.

Best
Michael

> On 2016-06-14, at 23:42, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Hi,
> 
> Is this automatic stats update is basic statistics or for all columns?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 22:10, Michael Häusler <michael@akatose.de <mailto:michael@akatose.de>>
wrote:
> Hi Mich,
> 
> I agree with Pengcheng here. Automatic stats gathering can be extremely useful - and
it is configurable.
> 
> E.g., initial import into Hive happens as CSV or AVRO. Then you might want to do a conversion
to ORC within Hive via create-table-as-select. At that point Hive is reading all the records
anyway and we might just as well get as much useful information as possible for stats.
> 
> Best
> Michael
> 
> 
>> On 2016-06-14, at 23:05, Pengcheng Xiong <pxiong@apache.org <mailto:pxiong@apache.org>>
wrote:
>> 
>> Hi Mich,
>> 
>>     I agree with you that column stats gathering in Hive is not cheap and comes with
overheads. This is due to the large volume of data that Hive has to process. However, this
is the price you have to pay anyway even with current "analyze table" solution.
>> 
>>    The new feature not only provides a way to make users have column stats automatically,
but also saves overhead for the "insert into" case. In this case, the new column stats are
generated incrementally, i.e., by merging with the existing stats. Without this feature, you
have to scan the whole table and compute stats. 
>> 
>>    In conclusion, this new feature should not have any more overhead than the current
solution.  
>> 
>> Best
>> Pengcheng
>> 
>> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>> hi,
>> 
>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true,
it will automatically collect column stats for you when you insert into/overwrite a new table.
You can refer to HIVE-11160 for more details.
>> 
>> Not without its overheads.
>> 
>> Automatic gather stats is not new. Has been around for a good time in RDBMS and can
impact the performance of other queries running. So I am not sure it can be considered as
blessing.
>> 
>> HTH
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 14 June 2016 at 21:25, Pengcheng Xiong <pxiong@apache.org <mailto:pxiong@apache.org>>
wrote:
>> Exactly, "the useful or meaningful these statistics is going to be" (The motivation
behind).
>> 
>> Best
>> Pengcheng
>> 
>> 
>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>> 
>> Hi,
>> 
>> My point was we are where we are and in this juncture there is no collection of statistics
for complex columns. That may be a future enhancement.
>> 
>> But then the obvious question is how useful or meaningful these statistics is going
to be?
>> 
>> HTH 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 14 June 2016 at 21:03, Michael Häusler <michael@akatose.de <mailto:michael@akatose.de>>
wrote:
>> Hi there,
>> 
>> there might be two topics here:
>> 
>> 1) feasibility of stats for non-primitive columns
>> 2) ease of use
>> 
>> 
>> 1) feasibility of stats for non-primitive columns:
>> 
>> Hive currently collects different kind of statistics for different kind of types:
>> numeric values:	min, max, #nulls, #distincts
>> boolean values:	#nulls, #trues, #falses
>> string values:		#nulls, #distincts, avgLength, maxLength
>> 
>> So, it seems quite possible to also collect at least partial stats for top-level
non-primitive columns, e.g.:
>> array values:		#nulls, #distincts, avgLength, maxLength 
>> map values:		#nulls, #distincts, avgLength, maxLength
>> struct values:		#nulls, #distincts
>> union values:		#nulls, #distincts
>> 
>> 
>> 2) ease of use
>> 
>> The presence of a single non-primitive column currently breaks the use of the convenience
shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;).
Imho, this slows down adoption of column statistics for hive users.
>> 
>> Best regards
>> Michael
>> 
>> 
>> 
>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>>> 
>>> Hi Michael,
>>> 
>>> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
>>> 
>>> When I am looking at this table in Oracle, I only see statistics for primitives
columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into
its primitive columns.  I don't think Hive has the means to do that.
>>> 
>>> desc tab_col_stats;
>>>  Name                                                                     Null?
   Type
>>>  ------------------------------------------------------------------------ --------
-------------------------------------------------
>>>  CS_ID                                                                    NOT
NULL NUMBER
>>>  DB_NAME                                                                  NOT
NULL VARCHAR2(128)
>>>  TABLE_NAME                                                               NOT
NULL VARCHAR2(128)
>>>  COLUMN_NAME                                                              NOT
NULL VARCHAR2(1000)
>>>  COLUMN_TYPE                                                              NOT
NULL VARCHAR2(128)
>>>  TBL_ID                                                                   NOT
NULL NUMBER
>>>  LONG_LOW_VALUE                                                             
      NUMBER
>>>  LONG_HIGH_VALUE                                                            
      NUMBER
>>>  DOUBLE_LOW_VALUE                                                           
      NUMBER
>>>  DOUBLE_HIGH_VALUE                                                          
      NUMBER
>>>  BIG_DECIMAL_LOW_VALUE                                                      
      VARCHAR2(4000)
>>>  BIG_DECIMAL_HIGH_VALUE                                                     
      VARCHAR2(4000)
>>>  NUM_NULLS                                                                NOT
NULL NUMBER
>>>  NUM_DISTINCTS                                                              
      NUMBER
>>>  AVG_COL_LEN                                                                
      NUMBER
>>>  MAX_COL_LEN                                                                
      NUMBER
>>>  NUM_TRUES                                                                  
      NUMBER
>>>  NUM_FALSES                                                                 
      NUMBER
>>>  LAST_ANALYZED                                                            NOT
NULL NUMBER
>>> 
>>> 
>>> 
>>>  So in summary although column type STRUCT do exit, I don't think Hive can cater
for their statistics. Actually I don't think Oracle itself does it.
>>> 
>>> HTH
>>> 
>>> P.S. I am on Hive 2 and it does not.
>>> 
>>> hive> analyze table foo compute statistics for columns;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but
array<bigint> is passed.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de <mailto:michael@akatose.de>>
wrote:
>>> Hi there,
>>> 
>>> you can reproduce the messages below with Hive 1.2.1.
>>> 
>>> Best regards
>>> Michael
>>> 
>>> 
>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>>>> 
>>>> which version of Hive are you using?
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> 
>>>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <mailto:michael@akatose.de>>
wrote:
>>>> Hi there,
>>>> 
>>>> 
>>>> when testing column statistics I stumbled upon the following error message:
>>>> 
>>>> DROP TABLE IF EXISTS foo;
>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
>>>> 
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted
but array<bigint> is passed.
>>>> 
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted
but struct<key:string,value:string> is passed.
>>>> 
>>>> 
>>>> 1) Basically, it seems that column statistics don't work for non-primitive
types. Are there any workarounds or any plans to change this?
>>>> 
>>>> 2) Furthermore, the convenience syntax to compute statistics for all columns
does not work as soon as there is a non-supported column. Are there any plans to change this,
so it is easier to compute statistics for all supported columns?
>>>> 
>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error
message. Especially for wide tables it would be much easier if all non-supported column *names*
would be printed.
>>>> 
>>>> 
>>>> Best regards
>>>> Michael
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> 


Mime
View raw message