hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: column statistics for non-primitive types
Date Tue, 14 Jun 2016 20:21:38 GMT
Hi,

My point was we are where we are and in this juncture there is no
collection of statistics for complex columns. That may be a future
enhancement.

But then the obvious question is how useful or meaningful these statistics
is going to be?

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 21:03, Michael Häusler <michael@akatose.de> wrote:

> Hi there,
>
> there might be two topics here:
>
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
>
>
> 1) feasibility of stats for non-primitive columns:
>
> Hive currently collects different kind of statistics for different kind of
> types:
> numeric values: min, max, #nulls, #distincts
> boolean values: #nulls, #trues, #falses
> string values: #nulls, #distincts, avgLength, maxLength
>
> So, it seems quite possible to also collect at least partial stats for
> top-level non-primitive columns, e.g.:
> array values: #nulls, #distincts, avgLength, maxLength
> map values: #nulls, #distincts, avgLength, maxLength
> struct values: #nulls, #distincts
> union values: #nulls, #distincts
>
>
> 2) ease of use
>
> The presence of a single non-primitive column currently breaks the use of
> the convenience shorthand to gather statistics for all columns (ANALYZE
> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
> of column statistics for hive users.
>
> Best regards
> Michael
>
>
>
> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
> Hi Michael,
>
> Statistics for columns in Hive are kept in Hive metadata table
> tab_col_stats.
>
> When I am looking at this table in Oracle, I only see statistics for
> primitives columns here. STRUCT columns do not have it as a STRUCT column
> will have to be broken into its primitive columns.  I don't think Hive has
> the means to do that.
>
> desc tab_col_stats;
>  Name
> Null?    Type
>  ------------------------------------------------------------------------
> -------- -------------------------------------------------
>  CS_ID
> NOT NULL NUMBER
>  DB_NAME
> NOT NULL VARCHAR2(128)
>  TABLE_NAME
> NOT NULL VARCHAR2(128)
>  COLUMN_NAME
> NOT NULL VARCHAR2(1000)
>  COLUMN_TYPE
> NOT NULL VARCHAR2(128)
>  TBL_ID
> NOT NULL NUMBER
>  LONG_LOW_VALUE
> NUMBER
>  LONG_HIGH_VALUE
> NUMBER
>  DOUBLE_LOW_VALUE
> NUMBER
>  DOUBLE_HIGH_VALUE
> NUMBER
>  BIG_DECIMAL_LOW_VALUE
> VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE
> VARCHAR2(4000)
>  NUM_NULLS
> NOT NULL NUMBER
>  NUM_DISTINCTS
> NUMBER
>  AVG_COL_LEN
> NUMBER
>  MAX_COL_LEN
> NUMBER
>  NUM_TRUES
> NUMBER
>  NUM_FALSES
> NUMBER
>  LAST_ANALYZED
> NOT NULL NUMBER
>
>
>
>  So in summary although column type STRUCT do exit, I don't think Hive can
> cater for their statistics. Actually I don't think Oracle itself does it.
>
> HTH
>
> P.S. I am on Hive 2 and it does not.
>
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but array<bigint> is passed.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de> wrote:
>
>> Hi there,
>>
>> you can reproduce the messages below with Hive 1.2.1.
>>
>> Best regards
>> Michael
>>
>>
>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> which version of Hive are you using?
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>>
>>> when testing column statistics I stumbled upon the following error
>>> message:
>>>
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>> STRUCT<key:STRING,value:STRING>);
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but array<bigint> is passed.
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but struct<key:string,value:string> is passed.
>>>
>>>
>>> 1) Basically, it seems that column statistics don't work for
>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>
>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>> columns does not work as soon as there is a non-supported column. Are there
>>> any plans to change this, so it is easier to compute statistics for all
>>> supported columns?
>>>
>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error
>>> message. Especially for wide tables it would be much easier if all
>>> non-supported column *names* would be printed.
>>>
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>
>>
>
>

Mime
View raw message