hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Häusler <mich...@akatose.de>
Subject Re: column statistics for non-primitive types
Date Tue, 14 Jun 2016 20:03:11 GMT
Hi there,

there might be two topics here:

1) feasibility of stats for non-primitive columns
2) ease of use


1) feasibility of stats for non-primitive columns:

Hive currently collects different kind of statistics for different kind of types:
numeric values:	min, max, #nulls, #distincts
boolean values:	#nulls, #trues, #falses
string values:		#nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also collect at least partial stats for top-level non-primitive
columns, e.g.:
array values:		#nulls, #distincts, avgLength, maxLength 
map values:		#nulls, #distincts, avgLength, maxLength
struct values:		#nulls, #distincts
union values:		#nulls, #distincts


2) ease of use

The presence of a single non-primitive column currently breaks the use of the convenience
shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;).
Imho, this slows down adoption of column statistics for hive users.

Best regards
Michael



> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Hi Michael,
> 
> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
> 
> When I am looking at this table in Oracle, I only see statistics for primitives columns
here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive
columns.  I don't think Hive has the means to do that.
> 
> desc tab_col_stats;
>  Name                                                                     Null?    Type
>  ------------------------------------------------------------------------ -------- -------------------------------------------------
>  CS_ID                                                                    NOT NULL NUMBER
>  DB_NAME                                                                  NOT NULL VARCHAR2(128)
>  TABLE_NAME                                                               NOT NULL VARCHAR2(128)
>  COLUMN_NAME                                                              NOT NULL VARCHAR2(1000)
>  COLUMN_TYPE                                                              NOT NULL VARCHAR2(128)
>  TBL_ID                                                                   NOT NULL NUMBER
>  LONG_LOW_VALUE                                                                    NUMBER
>  LONG_HIGH_VALUE                                                                   NUMBER
>  DOUBLE_LOW_VALUE                                                                  NUMBER
>  DOUBLE_HIGH_VALUE                                                                 NUMBER
>  BIG_DECIMAL_LOW_VALUE                                                             VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE                                                            VARCHAR2(4000)
>  NUM_NULLS                                                                NOT NULL NUMBER
>  NUM_DISTINCTS                                                                     NUMBER
>  AVG_COL_LEN                                                                       NUMBER
>  MAX_COL_LEN                                                                       NUMBER
>  NUM_TRUES                                                                         NUMBER
>  NUM_FALSES                                                                        NUMBER
>  LAST_ANALYZED                                                            NOT NULL NUMBER
> 
> 
> 
>  So in summary although column type STRUCT do exit, I don't think Hive can cater for
their statistics. Actually I don't think Oracle itself does it.
> 
> HTH
> 
> P.S. I am on Hive 2 and it does not.
> 
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint>
is passed.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de <mailto:michael@akatose.de>>
wrote:
> Hi there,
> 
> you can reproduce the messages below with Hive 1.2.1.
> 
> Best regards
> Michael
> 
> 
>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
>> 
>> which version of Hive are you using?
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de <mailto:michael@akatose.de>>
wrote:
>> Hi there,
>> 
>> 
>> when testing column statistics I stumbled upon the following error message:
>> 
>> DROP TABLE IF EXISTS foo;
>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>);
>> 
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint>
is passed.
>> 
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string>
is passed.
>> 
>> 
>> 1) Basically, it seems that column statistics don't work for non-primitive types.
Are there any workarounds or any plans to change this?
>> 
>> 2) Furthermore, the convenience syntax to compute statistics for all columns does
not work as soon as there is a non-supported column. Are there any plans to change this, so
it is easier to compute statistics for all supported columns?
>> 
>> 3) ANALYZE TABLE will only provide the first failing *type* in the error message.
Especially for wide tables it would be much easier if all non-supported column *names* would
be printed.
>> 
>> 
>> Best regards
>> Michael
>> 
>> 
> 
> 


Mime
View raw message