hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pengcheng Xiong <pxi...@apache.org>
Subject Re: column statistics for non-primitive types
Date Tue, 14 Jun 2016 21:05:17 GMT
Hi Mich,

    I agree with you that column stats gathering in Hive is not cheap and
comes with overheads. This is due to the large volume of data that Hive has
to process. However, this is the price you have to pay anyway even with
current "analyze table" solution.

   The new feature not only provides a way to make users have column stats
automatically, but also saves overhead for the "insert into" case. In this
case, the new column stats are generated incrementally, i.e., by merging
with the existing stats. Without this feature, you have to scan the whole
table and compute stats.

   In conclusion, this new feature should not have any more overhead than
the current solution.

Best
Pengcheng

On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> hi,
>
> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
> it to true, it will automatically collect column stats for you when you
> insert into/overwrite a new table. You can refer to HIVE-11160 for more
> details.
>
> Not without its overheads.
>
> Automatic gather stats is not new. Has been around for a good time in
> RDBMS and can impact the performance of other queries running. So I am not
> sure it can be considered as blessing.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 21:25, Pengcheng Xiong <pxiong@apache.org> wrote:
>
>> Exactly, "the useful or meaningful these statistics is going to be" (The
>> motivation behind).
>>
>> Best
>> Pengcheng
>>
>>
>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> My point was we are where we are and in this juncture there is no
>>> collection of statistics for complex columns. That may be a future
>>> enhancement.
>>>
>>> But then the obvious question is how useful or meaningful these
>>> statistics is going to be?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 June 2016 at 21:03, Michael Häusler <michael@akatose.de> wrote:
>>>
>>>> Hi there,
>>>>
>>>> there might be two topics here:
>>>>
>>>> 1) feasibility of stats for non-primitive columns
>>>> 2) ease of use
>>>>
>>>>
>>>> 1) feasibility of stats for non-primitive columns:
>>>>
>>>> Hive currently collects different kind of statistics for different kind
>>>> of types:
>>>> numeric values: min, max, #nulls, #distincts
>>>> boolean values: #nulls, #trues, #falses
>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>
>>>> So, it seems quite possible to also collect at least partial stats for
>>>> top-level non-primitive columns, e.g.:
>>>> array values: #nulls, #distincts, avgLength, maxLength
>>>> map values: #nulls, #distincts, avgLength, maxLength
>>>> struct values: #nulls, #distincts
>>>> union values: #nulls, #distincts
>>>>
>>>>
>>>> 2) ease of use
>>>>
>>>> The presence of a single non-primitive column currently breaks the use
>>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>>> of column statistics for hive users.
>>>>
>>>> Best regards
>>>> Michael
>>>>
>>>>
>>>>
>>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Michael,
>>>>
>>>> Statistics for columns in Hive are kept in Hive metadata table
>>>> tab_col_stats.
>>>>
>>>> When I am looking at this table in Oracle, I only see statistics for
>>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>>> will have to be broken into its primitive columns.  I don't think Hive has
>>>> the means to do that.
>>>>
>>>> desc tab_col_stats;
>>>>  Name
>>>> Null?    Type
>>>>  ------------------------------------------------------------------------
>>>> -------- -------------------------------------------------
>>>>  CS_ID
>>>> NOT NULL NUMBER
>>>>  DB_NAME
>>>> NOT NULL VARCHAR2(128)
>>>>  TABLE_NAME
>>>> NOT NULL VARCHAR2(128)
>>>>  COLUMN_NAME
>>>> NOT NULL VARCHAR2(1000)
>>>>  COLUMN_TYPE
>>>> NOT NULL VARCHAR2(128)
>>>>  TBL_ID
>>>> NOT NULL NUMBER
>>>>  LONG_LOW_VALUE
>>>> NUMBER
>>>>  LONG_HIGH_VALUE
>>>> NUMBER
>>>>  DOUBLE_LOW_VALUE
>>>> NUMBER
>>>>  DOUBLE_HIGH_VALUE
>>>> NUMBER
>>>>  BIG_DECIMAL_LOW_VALUE
>>>> VARCHAR2(4000)
>>>>  BIG_DECIMAL_HIGH_VALUE
>>>> VARCHAR2(4000)
>>>>  NUM_NULLS
>>>> NOT NULL NUMBER
>>>>  NUM_DISTINCTS
>>>> NUMBER
>>>>  AVG_COL_LEN
>>>> NUMBER
>>>>  MAX_COL_LEN
>>>> NUMBER
>>>>  NUM_TRUES
>>>> NUMBER
>>>>  NUM_FALSES
>>>> NUMBER
>>>>  LAST_ANALYZED
>>>> NOT NULL NUMBER
>>>>
>>>>
>>>>
>>>>  So in summary although column type STRUCT do exit, I don't think Hive
>>>> can cater for their statistics. Actually I don't think Oracle itself does
>>>> it.
>>>>
>>>> HTH
>>>>
>>>> P.S. I am on Hive 2 and it does not.
>>>>
>>>> hive> analyze table foo compute statistics for columns;
>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>> accepted but array<bigint> is passed.
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 June 2016 at 09:57, Michael Häusler <michael@akatose.de> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>
>>>>>
>>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>>> wrote:
>>>>>
>>>>> which version of Hive are you using?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 13 June 2016 at 16:00, Michael Häusler <michael@akatose.de>
wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>>
>>>>>> when testing column statistics I stumbled upon the following error
>>>>>> message:
>>>>>>
>>>>>> DROP TABLE IF EXISTS foo;
>>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>>
>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>> accepted but array<bigint> is passed.
>>>>>>
>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>>
>>>>>>
>>>>>> 1) Basically, it seems that column statistics don't work for
>>>>>> non-primitive types. Are there any workarounds or any plans to change
this?
>>>>>>
>>>>>> 2) Furthermore, the convenience syntax to compute statistics for
all
>>>>>> columns does not work as soon as there is a non-supported column.
Are there
>>>>>> any plans to change this, so it is easier to compute statistics for
all
>>>>>> supported columns?
>>>>>>
>>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>>> error message. Especially for wide tables it would be much easier
if all
>>>>>> non-supported column *names* would be printed.
>>>>>>
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message