Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A74F1200AF7 for ; Tue, 14 Jun 2016 22:03:26 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A5E2E160A47; Tue, 14 Jun 2016 20:03:26 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 572B3160A06 for ; Tue, 14 Jun 2016 22:03:25 +0200 (CEST) Received: (qmail 11539 invoked by uid 500); 14 Jun 2016 20:03:23 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 11529 invoked by uid 99); 14 Jun 2016 20:03:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jun 2016 20:03:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 8E180C0E86 for ; Tue, 14 Jun 2016 20:03:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1] autolearn=disabled Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id W39krAX1XBiS for ; Tue, 14 Jun 2016 20:03:20 +0000 (UTC) Received: from mail.kalahimbra.net (mail.kalahimbra.net [138.201.151.60]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id DB02E5F3A1 for ; Tue, 14 Jun 2016 20:03:19 +0000 (UTC) Received: by mail.kalahimbra.net (Postfix, from userid 1005) id F21E020293; Tue, 14 Jun 2016 22:03:01 +0200 (CEST) Received: from aditu.fritz.box (pD9E9BCF0.dip0.t-ipconnect.de [217.233.188.240]) by mail.kalahimbra.net (Postfix) with ESMTPSA id AF65920245 for ; Tue, 14 Jun 2016 22:03:01 +0200 (CEST) From: =?utf-8?Q?Michael_H=C3=A4usler?= Content-Type: multipart/alternative; boundary="Apple-Mail=_75A71316-778F-4E10-91DD-397536C1D2A0" Message-Id: <38E7D38A-93FC-49AA-A81C-47A1F29470B5@akatose.de> Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: column statistics for non-primitive types Date: Tue, 14 Jun 2016 22:03:11 +0200 References: <5144D9F8-53A0-4C22-A9AF-3F506F0F79B8@akatose.de> <4DED3455-7041-4D35-8C0F-012B7BE96D8C@akatose.de> To: user@hive.apache.org In-Reply-To: X-Mailer: Apple Mail (2.3124) archived-at: Tue, 14 Jun 2016 20:03:26 -0000 --Apple-Mail=_75A71316-778F-4E10-91DD-397536C1D2A0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi there, there might be two topics here: 1) feasibility of stats for non-primitive columns 2) ease of use 1) feasibility of stats for non-primitive columns: Hive currently collects different kind of statistics for different kind = of types: numeric values: min, max, #nulls, #distincts boolean values: #nulls, #trues, #falses string values: #nulls, #distincts, avgLength, maxLength So, it seems quite possible to also collect at least partial stats for = top-level non-primitive columns, e.g.: array values: #nulls, #distincts, avgLength, maxLength=20 map values: #nulls, #distincts, avgLength, maxLength struct values: #nulls, #distincts union values: #nulls, #distincts 2) ease of use The presence of a single non-primitive column currently breaks the use = of the convenience shorthand to gather statistics for all columns = (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows = down adoption of column statistics for hive users. Best regards Michael > On 2016-06-14, at 12:04, Mich Talebzadeh = wrote: >=20 > Hi Michael, >=20 > Statistics for columns in Hive are kept in Hive metadata table = tab_col_stats. >=20 > When I am looking at this table in Oracle, I only see statistics for = primitives columns here. STRUCT columns do not have it as a STRUCT = column will have to be broken into its primitive columns. I don't think = Hive has the means to do that. >=20 > desc tab_col_stats; > Name = Null? Type > = ------------------------------------------------------------------------ = -------- ------------------------------------------------- > CS_ID = NOT NULL NUMBER > DB_NAME = NOT NULL VARCHAR2(128) > TABLE_NAME = NOT NULL VARCHAR2(128) > COLUMN_NAME = NOT NULL VARCHAR2(1000) > COLUMN_TYPE = NOT NULL VARCHAR2(128) > TBL_ID = NOT NULL NUMBER > LONG_LOW_VALUE = NUMBER > LONG_HIGH_VALUE = NUMBER > DOUBLE_LOW_VALUE = NUMBER > DOUBLE_HIGH_VALUE = NUMBER > BIG_DECIMAL_LOW_VALUE = VARCHAR2(4000) > BIG_DECIMAL_HIGH_VALUE = VARCHAR2(4000) > NUM_NULLS = NOT NULL NUMBER > NUM_DISTINCTS = NUMBER > AVG_COL_LEN = NUMBER > MAX_COL_LEN = NUMBER > NUM_TRUES = NUMBER > NUM_FALSES = NUMBER > LAST_ANALYZED = NOT NULL NUMBER >=20 >=20 >=20 > So in summary although column type STRUCT do exit, I don't think Hive = can cater for their statistics. Actually I don't think Oracle itself = does it. >=20 > HTH >=20 > P.S. I am on Hive 2 and it does not. >=20 > hive> analyze table foo compute statistics for columns; > FAILED: UDFArgumentTypeException Only primitive type arguments are = accepted but array is passed. >=20 >=20 > Dr Mich Talebzadeh > =20 > LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = > =20 > http://talebzadehmich.wordpress.com = > =20 >=20 > On 14 June 2016 at 09:57, Michael H=C3=A4usler > wrote: > Hi there, >=20 > you can reproduce the messages below with Hive 1.2.1. >=20 > Best regards > Michael >=20 >=20 >> On 2016-06-13, at 22:21, Mich Talebzadeh > wrote: >>=20 >> which version of Hive are you using? >>=20 >> Dr Mich Talebzadeh >> =20 >> LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdO= ABUrV8Pw = >> =20 >> http://talebzadehmich.wordpress.com = >> =20 >>=20 >> On 13 June 2016 at 16:00, Michael H=C3=A4usler > wrote: >> Hi there, >>=20 >>=20 >> when testing column statistics I stumbled upon the following error = message: >>=20 >> DROP TABLE IF EXISTS foo; >> CREATE TABLE foo (foo BIGINT, bar ARRAY, foobar = STRUCT); >>=20 >> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS; >> FAILED: UDFArgumentTypeException Only primitive type arguments are = accepted but array is passed. >>=20 >> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar; >> FAILED: UDFArgumentTypeException Only primitive type arguments are = accepted but struct is passed. >>=20 >>=20 >> 1) Basically, it seems that column statistics don't work for = non-primitive types. Are there any workarounds or any plans to change = this? >>=20 >> 2) Furthermore, the convenience syntax to compute statistics for all = columns does not work as soon as there is a non-supported column. Are = there any plans to change this, so it is easier to compute statistics = for all supported columns? >>=20 >> 3) ANALYZE TABLE will only provide the first failing *type* in the = error message. Especially for wide tables it would be much easier if all = non-supported column *names* would be printed. >>=20 >>=20 >> Best regards >> Michael >>=20 >>=20 >=20 >=20 --Apple-Mail=_75A71316-778F-4E10-91DD-397536C1D2A0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi there,

there might be two topics here:

1) feasibility of stats for = non-primitive columns
2) ease of use


1) feasibility of stats for non-primitive columns:

Hive currently collects = different kind of statistics for different kind of types:
numeric values: min, max, #nulls, = #distincts
boolean values: #nulls, = #trues, #falses
string values: = #nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also = collect at least partial stats for top-level non-primitive columns, = e.g.:
array values: #nulls, #distincts, = avgLength, maxLength 
map values: = #nulls, #distincts, avgLength, maxLength
struct values: #nulls, = #distincts
union values: = #nulls, #distincts


2) ease of use

The presence of a single = non-primitive column currently breaks the use of the convenience = shorthand to gather statistics for all columns (ANALYZE TABLE foo = COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of = column statistics for hive users.

Best regards
Michael



On = 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

Hi Michael,

Statistics for columns in Hive are kept = in Hive metadata table tab_col_stats.

When I am looking at this table in = Oracle, I only see statistics for primitives columns here. STRUCT = columns do not have it as a STRUCT column will have to be broken into = its primitive columns.  I don't think Hive has the means to do = that.

desc = tab_col_stats;
 Name         = ;            &= nbsp;           &nb= sp;            = ;            &= nbsp;          = Null?    Type
 ---------------------------------------------------------= --------------- -------- = -------------------------------------------------
 CS_ID        &nbs= p;            =             &n= bsp;           &nbs= p;            =           NOT NULL = NUMBER
 DB_NAME        &n= bsp;           &nbs= p;            =             &n= bsp;           &nbs= p;        NOT NULL VARCHAR2(128)
 TABLE_NAME        = ;            &= nbsp;           &nb= sp;            = ;            &= nbsp;     NOT NULL VARCHAR2(128)
 COLUMN_NAME       &nbs= p;            =             &n= bsp;           &nbs= p;            =      NOT NULL VARCHAR2(1000)
 COLUMN_TYPE       &nbs= p;            =             &n= bsp;           &nbs= p;            =      NOT NULL VARCHAR2(128)
 TBL_ID        &nb= sp;            = ;            &= nbsp;           &nb= sp;            = ;         NOT NULL NUMBER
 LONG_LOW_VALUE       &= nbsp;           &nb= sp;            = ;            &= nbsp;           &nb= sp;           = NUMBER
 LONG_HIGH_VALUE       =             &n= bsp;           &nbs= p;            =             &n= bsp;          NUMBER
 DOUBLE_LOW_VALUE       = ;            &= nbsp;           &nb= sp;            = ;            &= nbsp;         NUMBER
 DOUBLE_HIGH_VALUE      &nbs= p;            =             &n= bsp;           &nbs= p;            =          NUMBER
 BIG_DECIMAL_LOW_VALUE      =             &n= bsp;           &nbs= p;            =             &n= bsp;     VARCHAR2(4000)
 BIG_DECIMAL_HIGH_VALUE      = ;            &= nbsp;           &nb= sp;            = ;            &= nbsp;    VARCHAR2(4000)
 NUM_NULLS        =             &n= bsp;           &nbs= p;            =             &n= bsp;      NOT NULL NUMBER
 NUM_DISTINCTS       &n= bsp;           &nbs= p;            =             &n= bsp;           &nbs= p;            = NUMBER
 AVG_COL_LEN       &nbs= p;            =             &n= bsp;           &nbs= p;            =             &n= bsp; NUMBER
 MAX_COL_LEN       &nbs= p;            =             &n= bsp;           &nbs= p;            =             &n= bsp; NUMBER
 NUM_TRUES        =             &n= bsp;           &nbs= p;            =             &n= bsp;           &nbs= p;   NUMBER
 NUM_FALSES        = ;            &= nbsp;           &nb= sp;            = ;            &= nbsp;           &nb= sp;  NUMBER
 LAST_ANALYZED       &n= bsp;           &nbs= p;            =             &n= bsp;           &nbs= p;   NOT NULL NUMBER




 So in summary = although column type STRUCT do exit, I don't think Hive can cater for = their statistics. Actually I don't think Oracle itself does = it.

HTH

P.S. I am on Hive 2 and it does not.

hive> analyze table foo = compute statistics for columns;
FAILED: = UDFArgumentTypeException Only primitive type arguments are accepted but = array<bigint> is passed.



On 14 June 2016 at 09:57, = Michael H=C3=A4usler <michael@akatose.de> wrote:
Hi there,

you can reproduce the = messages below with Hive 1.2.1.

Best regards
Michael


On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

which version of = Hive are you using?


On 13 June 2016 at 16:00, = Michael H=C3=A4usler <michael@akatose.de> wrote:
Hi there,


when testing column statistics I stumbled upon the following error = message:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar = STRUCT<key:STRING,value:STRING>);

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
FAILED: UDFArgumentTypeException Only primitive type arguments are = accepted but array<bigint> is passed.

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
FAILED: UDFArgumentTypeException Only primitive type arguments are = accepted but struct<key:string,value:string> is passed.


1) Basically, it seems that column statistics don't work for = non-primitive types. Are there any workarounds or any plans to change = this?

2) Furthermore, the convenience syntax to compute statistics for all = columns does not work as soon as there is a non-supported column. Are = there any plans to change this, so it is easier to compute statistics = for all supported columns?

3) ANALYZE TABLE will only provide the first failing *type* in the error = message. Especially for wide tables it would be much easier if all = non-supported column *names* would be printed.


Best regards
Michael
=




= --Apple-Mail=_75A71316-778F-4E10-91DD-397536C1D2A0--