hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pengcheng Xiong <pxi...@apache.org>
Subject Re: column statistics for non-primitive types
Date Tue, 14 Jun 2016 21:19:36 GMT
Hi Michael,

    Thanks for your reply.

    (1) We have considered your first point before as well. However, join
reordering is now done by Calcite in Hive optimizer. Calcite can not deal
with complex/structure types. Thus, we may need to first make improvement
in Calcite and then make the column stats for complex/structure type
useful. Your second point to use stats directly is valid. But we would like
to know how many use cases are there and how popular it is.

    (2) AFAIK, it only supports insert overwrite and insert into. I will
investigate the support of CTAS.

    (3) The new feature is in Hive 2.1. If possible, could u upgrade and
try?

Thanks.

Best
Pengcheng

On Tue, Jun 14, 2016 at 1:42 PM, Michael Häusler <michael@akatose.de> wrote:

> Hi Pengcheng,
>
> (1)
> statistics on non-primitive columns can be just as useful as on primitive
> columns, e.g.,
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (id BIGINT, someArray ARRAY<BIGINT>, someStruct
> STRUCT<key:STRING,value:STRING>);
>
> a) query optimization
> Let foo be a huge table that needs to be joined with another huge table
> bar like this
>
> SELECT
> f.id
> FROM
> foo f
> JOIN
> bar b
> ON
> f.id = b.id
> WHERE
> f.someArray IS NOT NULL
>
> If statistics tell us that #nulls in someArray is small, we could apply a
> different join strategy (e.g., map-side join, bar main table, filtered foo
> as hash table)
>
> b) answer query directly
>
> SELECT
> COUNT(DISTINCT someStruct)
> FROM
> foo;
>
> Such a query can easily be answered directly from stats.
>
>
>
> (2)
>
> Do you happen to know, whether HIVE-11160 also works for CTAS?
> Because a quick test of the configuration property did not work for me:
>
> hive> SET hive.stats.fetch.column.stats=true;
> hive> DROP TABLE IF EXISTS foo;
> OK
> Time taken: 6.585 seconds
> hive> CREATE TABLE foo AS
>     > SELECT
>     >     1 AS foo;
> Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef
> Total jobs = 1
> Launching Job 1 out of 1
>
>
> Status: Running (Executing on YARN cluster with App id
> application_1465334589772_15920)
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>  KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0       0
>     0
>
> --------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.89 s
>
> --------------------------------------------------------------------------------
> Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo
> Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194,
> rawDataSize=4]
> OK
> Time taken: 8.088 seconds
> hive> DESCRIBE FORMATTED foo.foo;
> OK
> # col_name              data_type               min
> max                     num_nulls               distinct_count
>  avg_col_len             max_col_len             num_trues
> num_falses              comment
>
> foo                     int
>
>
>               from deserializer
> Time taken: 0.197 seconds, Fetched: 3 row(s)
>
> ^^^ the table creation works, but I don't get any column stats.
>
>
> Best regards
> Michael
>
>
>
> On 2016-06-14, at 22:23, Pengcheng Xiong <pxiong@apache.org> wrote:
>
> Hi Michael,
>
>     (1) We collect columns stats for the following purpose (a) Query
> optimization, esp. join reordering and big/small table size estimation.
> More recently, we also use it to remove filters. You can refer to Calcite
> rules. (b) Answer query directly through metaStore. You can refer to the
> configuration of HIVEOPTIMIZEMETADATAQUERIES(
> "hive.compute.query.using.stats").
>
>     We can do stats for non-primitive columns, but we need to know the
> motivation to do so before we do it. If you can, could you please list some?
>
>    (2) There is a configuration "hive.stats.fetch.column.stats". If you
> set it to true, it will automatically collect column stats for you when you
> insert into/overwrite a new table. You can refer to HIVE-11160 for more
> details.
>
>    Hope my answers help.
>
> Thanks
>
> Best.
> Pengcheng
>
>
> On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <michael@akatose.de>
> wrote:
>
>> Hi there,
>>
>> there might be two topics here:
>>
>> 1) feasibility of stats for non-primitive columns
>> 2) ease of use
>>
>>
>> 1) feasibility of stats for non-primitive columns:
>>
>> Hive currently collects different kind of statistics for different kind
>> of types:
>> numeric values: min, max, #nulls, #distincts
>> boolean values: #nulls, #trues, #falses
>> string values: #nulls, #distincts, avgLength, maxLength
>>
>> So, it seems quite possible to also collect at least partial stats for
>> top-level non-primitive columns, e.g.:
>> array values: #nulls, #distincts, avgLength, maxLength
>> map values: #nulls, #distincts, avgLength, maxLength
>> struct values: #nulls, #distincts
>> union values: #nulls, #distincts
>>
>>
>> 2) ease of use
>>
>> The presence of a single non-primitive column currently breaks the use of
>> the convenience shorthand to gather statistics for all columns (ANALYZE
>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>> of column statistics for hive users.
>>
>> Best regards
>> Michael
>>
>>
>>
>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> Hi Michael,
>>
>> Statistics for columns in Hive are kept in Hive metadata table
>> tab_col_stats.
>>
>> When I am looking at this table in Oracle, I only see statistics for
>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>> will have to be broken into its primitive columns.  I don't think Hive has
>> the means to do that.
>>
>> desc tab_col_stats;
>>  Name
>> Null?    Type
>>  ------------------------------------------------------------------------
>> -------- -------------------------------------------------
>>  CS_ID
>> NOT NULL NUMBER
>>  DB_NAME
>> NOT NULL VARCHAR2(128)
>>  TABLE_NAME
>> NOT NULL VARCHAR2(128)
>>  COLUMN_NAME
>> NOT NULL VARCHAR2(1000)
>>  COLUMN_TYPE
>> NOT NULL VARCHAR2(128)
>>  TBL_ID
>> NOT NULL NUMBER
>>  LONG_LOW_VALUE
>> NUMBER
>>  LONG_HIGH_VALUE
>> NUMBER
>>  DOUBLE_LOW_VALUE
>> NUMBER
>>  DOUBLE_HIGH_VALUE
>> NUMBER
>>  BIG_DECIMAL_LOW_VALUE
>> VARCHAR2(4000)
>>  BIG_DECIMAL_HIGH_VALUE
>> VARCHAR2(4000)
>>  NUM_NULLS
>> NOT NULL NUMBER
>>  NUM_DISTINCTS
>> NUMBER
>>  AVG_COL_LEN
>> NUMBER
>>  MAX_COL_LEN
>> NUMBER
>>  NUM_TRUES
>> NUMBER
>>  NUM_FALSES
>> NUMBER
>>  LAST_ANALYZED
>> NOT NULL NUMBER
>>
>>
>>
>>  So in summary although column type STRUCT do exit, I don't think Hive
>> can cater for their statistics. Actually I don't think Oracle itself does
>> it.
>>
>> HTH
>>
>> P.S. I am on Hive 2 and it does not.
>>
>> hive> analyze table foo compute statistics for columns;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>> accepted but array<bigint> is passed.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de> wrote:
>>
>>> Hi there,
>>>
>>> you can reproduce the messages below with Hive 1.2.1.
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>> wrote:
>>>
>>> which version of Hive are you using?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de> wrote:
>>>
>>>> Hi there,
>>>>
>>>>
>>>> when testing column statistics I stumbled upon the following error
>>>> message:
>>>>
>>>> DROP TABLE IF EXISTS foo;
>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>> STRUCT<key:STRING,value:STRING>);
>>>>
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but array<bigint> is passed.
>>>>
>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but struct<key:string,value:string> is passed.
>>>>
>>>>
>>>> 1) Basically, it seems that column statistics don't work for
>>>> non-primitive types. Are there any workarounds or any plans to change this?
>>>>
>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>> columns does not work as soon as there is a non-supported column. Are there
>>>> any plans to change this, so it is easier to compute statistics for all
>>>> supported columns?
>>>>
>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>> error message. Especially for wide tables it would be much easier if all
>>>> non-supported column *names* would be printed.
>>>>
>>>>
>>>> Best regards
>>>> Michael
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message