hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pengcheng Xiong <pxi...@apache.org>
Subject Re: column statistics for non-primitive types
Date Tue, 14 Jun 2016 20:23:17 GMT
Hi Michael,

    (1) We collect columns stats for the following purpose (a) Query
optimization, esp. join reordering and big/small table size estimation.
More recently, we also use it to remove filters. You can refer to Calcite
rules. (b) Answer query directly through metaStore. You can refer to the
configuration of HIVEOPTIMIZEMETADATAQUERIES(
"hive.compute.query.using.stats").

    We can do stats for non-primitive columns, but we need to know the
motivation to do so before we do it. If you can, could you please list some?

   (2) There is a configuration "hive.stats.fetch.column.stats". If you set
it to true, it will automatically collect column stats for you when you
insert into/overwrite a new table. You can refer to HIVE-11160 for more
details.

   Hope my answers help.

Thanks

Best.
Pengcheng


On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <michael@akatose.de> wrote:

> Hi there,
>
> there might be two topics here:
>
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
>
>
> 1) feasibility of stats for non-primitive columns:
>
> Hive currently collects different kind of statistics for different kind of
> types:
> numeric values: min, max, #nulls, #distincts
> boolean values: #nulls, #trues, #falses
> string values: #nulls, #distincts, avgLength, maxLength
>
> So, it seems quite possible to also collect at least partial stats for
> top-level non-primitive columns, e.g.:
> array values: #nulls, #distincts, avgLength, maxLength
> map values: #nulls, #distincts, avgLength, maxLength
> struct values: #nulls, #distincts
> union values: #nulls, #distincts
>
>
> 2) ease of use
>
> The presence of a single non-primitive column currently breaks the use of
> the convenience shorthand to gather statistics for all columns (ANALYZE
> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
> of column statistics for hive users.
>
> Best regards
> Michael
>
>
>
> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
> Hi Michael,
>
> Statistics for columns in Hive are kept in Hive metadata table
> tab_col_stats.
>
> When I am looking at this table in Oracle, I only see statistics for
> primitives columns here. STRUCT columns do not have it as a STRUCT column
> will have to be broken into its primitive columns.  I don't think Hive has
> the means to do that.
>
> desc tab_col_stats;
>  Name
> Null?    Type
>  ------------------------------------------------------------------------
> -------- -------------------------------------------------
>  CS_ID
> NOT NULL NUMBER
>  DB_NAME
> NOT NULL VARCHAR2(128)
>  TABLE_NAME
> NOT NULL VARCHAR2(128)
>  COLUMN_NAME
> NOT NULL VARCHAR2(1000)
>  COLUMN_TYPE
> NOT NULL VARCHAR2(128)
>  TBL_ID
> NOT NULL NUMBER
>  LONG_LOW_VALUE
> NUMBER
>  LONG_HIGH_VALUE
> NUMBER
>  DOUBLE_LOW_VALUE
> NUMBER
>  DOUBLE_HIGH_VALUE
> NUMBER
>  BIG_DECIMAL_LOW_VALUE
> VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE
> VARCHAR2(4000)
>  NUM_NULLS
> NOT NULL NUMBER
>  NUM_DISTINCTS
> NUMBER
>  AVG_COL_LEN
> NUMBER
>  MAX_COL_LEN
> NUMBER
>  NUM_TRUES
> NUMBER
>  NUM_FALSES
> NUMBER
>  LAST_ANALYZED
> NOT NULL NUMBER
>
>
>
>  So in summary although column type STRUCT do exit, I don't think Hive can
> cater for their statistics. Actually I don't think Oracle itself does it.
>
> HTH
>
> P.S. I am on Hive 2 and it does not.
>
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but array<bigint> is passed.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de> wrote:
>
>> Hi there,
>>
>> you can reproduce the messages below with Hive 1.2.1.
>>
>> Best regards
>> Michael
>>
>>
>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> which version of Hive are you using?
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>>
>>> when testing column statistics I stumbled upon the following error
>>> message:
>>>
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>> STRUCT<key:STRING,value:STRING>);
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but array<bigint> is passed.
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepted but struct<key:string,value:string> is passed.
>>>
>>>
>>> 1) Basically, it seems that column statistics don't work for
>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>
>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>> columns does not work as soon as there is a non-supported column. Are there
>>> any plans to change this, so it is easier to compute statistics for all
>>> supported columns?
>>>
>>> 3) ANALYZE TABLE will only provide the first failing *type* in the error
>>> message. Especially for wide tables it would be much easier if all
>>> non-supported column *names* would be printed.
>>>
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>
>>
>
>

Mime
View raw message