Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 36F09200C63 for ; Thu, 11 May 2017 21:54:44 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 35825160BC7; Thu, 11 May 2017 19:54:44 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2E253160BB3 for ; Thu, 11 May 2017 21:54:43 +0200 (CEST) Received: (qmail 66111 invoked by uid 500); 11 May 2017 19:54:42 -0000 Mailing-List: contact user-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@impala.incubator.apache.org Delivered-To: mailing list user@impala.incubator.apache.org Received: (qmail 66100 invoked by uid 99); 11 May 2017 19:54:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 May 2017 19:54:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id E8D14C03A8 for ; Thu, 11 May 2017 19:54:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 0L_bvutTt1-b for ; Thu, 11 May 2017 19:54:38 +0000 (UTC) Received: from mail-it0-f48.google.com (mail-it0-f48.google.com [209.85.214.48]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 87E085FB4E for ; Thu, 11 May 2017 19:54:37 +0000 (UTC) Received: by mail-it0-f48.google.com with SMTP id o5so50531734ith.1 for ; Thu, 11 May 2017 12:54:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=aR2uj9xnwV0IsMma3myjaFHbU1iJcBXumYAyu5q3yv8=; b=rmzUpHYt4rTjYdKIyuFlW9rVynfZcONQWSISQk12QvDbv1rwzYrsWvjerpohwnX6cT e6PhM3Z0IVuR21M7jPQkc6fsiXuiaEO+ExSTNAK91ZGO1lDTF2b2GIH07NpsU2Sm7J9x /csyWvz7Lx+BG4ulamfMZKiPAYdFR6ch+8yA+gMB5ecU+ThzOtSeJD0z0uOOkbWkP81C /Ea01M8pPrdbxghBH44bN8q/4znk5Fme4ImdDT8N72OQTDKb7kzelbVLiyUi7B0ZzdLm bLCLcuqvhmKNGkEuf7sqrKp4jQpEwXkeohCy+f/xJPwoOSnZ5Ss1taxNj4gsZJTlXwFJ d7zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=aR2uj9xnwV0IsMma3myjaFHbU1iJcBXumYAyu5q3yv8=; b=ooozfWwoFyFKht955djyzVfmCi/1XvQP+NVHAAHr5CbbCBNR3973/2AghO7yTN4/Vl r0EOlBLMY0fQJnkWof49M8OFV50X55m2l3UkDgk9MqfL7Aug4Mh644PxCkRb/iGQ8Q1F XTvvpS0POPBOpZanEc3Q1MbZ6J+WGPcS2jq43fUAvdj52+jubtJ+p7qE9vc2vWKspqE+ M9tBy9u9eE76uG8FrdPs94+37RqF1OMqdadlN+LiKBzQPDDtXgSctWNfXg3Dvj0hO4N/ TJoy8+zPeVmhVFGtUKYy42QEWeTECWosoeSJvqWWORPzMoOFT++vR827aoHmSh8JGUMY jewQ== X-Gm-Message-State: AODbwcBAn2LCEPq7qnawcQBXR3G25cwdT4uGKH6XlWtexUUQwZx+NIFk jPsEZFXVbRiBG/lRUjXR7dZ6eObpgwsPR/k= X-Received: by 10.36.92.84 with SMTP id q81mr622210itb.46.1494532474680; Thu, 11 May 2017 12:54:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.143.139 with HTTP; Thu, 11 May 2017 12:54:12 -0700 (PDT) In-Reply-To: References: From: Matthew Jacobs Date: Thu, 11 May 2017 12:54:12 -0700 Message-ID: Subject: Re: Enable Impala-kudu table, column stats. To: user@impala.incubator.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable archived-at: Thu, 11 May 2017 19:54:44 -0000 On Thu, Apr 27, 2017 at 12:38 AM, =EA=B8=B0=EC=A4=80 <0ctopus13prime@gmail.= com> wrote: > Hi Jeszy! > > First of all, thanks for your kind reply. > > It took me some times to read the article you suggested, > and tried to apply my environment. > > Question 1. Why column statistics have inaccurate value? I think Jeszy's answer addressed why there can be inaccuracies. > Question 2. After command `alter table analysis_data set > tblproperties('numRows'=3D'1001000000', > 'STATS_GENERATED_VIA_STATS_TASK'=3D'true');` > Still '# Rows' has -1 value. > > +--------+-----------+----------+--------------------------------------+-= -----------+ > | # Rows | Start Key | Stop Key | Leader Replica > | # Replicas | > | -1 | | 00000001 | ${host1} | 3 | > | -1 | 00000001 | 00000002 | ${host2} | 3 | > | -1 | 00000002 | 00000003 | ${host3} | 3 | > > > And another command `alter table partitioned_data partition(year=3D2009, > month=3D4) set tblproperties ('numRows'=3D'30000', > 'STATS_GENERATED_VIA_STATS_TASK'=3D'true');` > does not applied kudu table. These are indicating that there are no _per-partition_ statistics. You set the table statistics, which is different and should have been applied. There is not yet a way to get per-partition statistics for Kudu -- for now you'll always see -1 there. Take a look at 'describe extended tbl' to see the num_rows property. Also show column stats should work as well. For example: [localhost:21000] > compute stats tdata; Query: compute stats tdata +-----------------------------------------+ | summary | +-----------------------------------------+ | Updated 1 partition(s) and 9 column(s). | +-----------------------------------------+ Fetched 1 row(s) in 0.51s [localhost:21000] > show column stats tdata; Query: show column stats tdata +--------+-----------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+-----------+------------------+--------+----------+----------+ | id | INT | 1494 | -1 | 4 | 4 | | valf | FLOAT | 1 | -1 | 4 | 4 | | vali | BIGINT | 1 | -1 | 8 | 8 | | valv | STRING | 1 | -1 | 3 | 3 | | valb | BOOLEAN | 2 | -1 | 1 | 1 | | valt | TINYINT | 1 | -1 | 1 | 1 | | vals | SMALLINT | 1 | -1 | 2 | 2 | | vald | DOUBLE | 1 | -1 | 8 | 8 | | ts | TIMESTAMP | 1 | -1 | 16 | 16 | +--------+-----------+------------------+--------+----------+----------+ [localhost:21000] > describe extended tdata; Query: describe extended tdata +------------------------------+-------------------------------------------= ------+-------------------------------------------+ | name | type | comment | +------------------------------+-------------------------------------------= ------+-------------------------------------------+ | # col_name | data_type | comment | | | NULL | NULL | ... | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | ... | Table Parameters: | NULL | NULL | | | DO_NOT_UPDATE_STATS | true | | | STATS_GENERATED_VIA_STATS_TASK | true | | | kudu.master_addresses | 127.0.0.1 | | | kudu.table_name | impala::default.tdata | | =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D> | numRows = | 1500 | ... > > Is there any command set rows information into kudu's stats? You did. You set the number of rows on the table. As I said there is no way to set partition stats, nor would it make any difference to the planner yet. You can reference IMPALA-2830 to track work related to Impala-Kudu statistics in the future. > > Thanks! Have a nice day :) > > > > > > 2017-04-25 20:41 GMT+09:00 Jeszy : >> Hey, >> >> The difference in the distinct values is expected, the estimation that >> the NDV function >> (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_= ndv.html) >> gives is good enough, and the execution is much faster. You can set >> both the table and the column stats manually as described here: >> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_p= erf_stats.html#perf_table_stats_manual. >> >> From your question it seems you expected Impala to return the column >> statistics-stored value for the distinct count. This is not possible >> currently, the intent with these statistics is to help planning by >> allowing Impala to come up with a more educated guess on join ordering >> and cardinality - if you look at a summary like the one you pasted, >> the estimated values (both #rows and peak mem) is influenced by stats. >> There is no way to tell whether the stats are stale (only whether they >> are computed or not). >> >> Does this answer your questions? >> >> 2017-04-25 13:26 GMT+02:00 =EA=B8=B0=EC=A4=80 <0ctopus13prime@gmail.com>= : >>> Hi! >>> >>> I'm using impala-kudu currently. >>> >>> impala's version is v2.7.0 >>> kudu's version is 1.3 >>> >>> I found out table statistics hint few days ago. >>> >>> So i tried compute statistics using command `compute stats`. >>> >>> After short time no errors shown my screen, but all the rows was -1. >>> >>> So i searched about this, then i could find this one. >>> https://issues.apache.org/jira/browse/IMPALA-2830 >>> >>> Question 1. Can i manually set rows? >>> >>> And i found column statistics computed with wrong value. >>> >>> For example, some column's actual distinct value was 5092153, >>> but command `show column stats ${table}` shows 5405440. >>> (Similar other columns too) >>> >>> Question 2. Why this difference happens? Also, can i set value manually= ? >>> >>> And after all i'm not clear impala use this information during query >>> processing. >>> >>> For example, >>> Issued `SELECT COUNT(DISTINCT ${column}) FROM ${table}`, >>> and i found impala scan from kudu using `summary` command. >>> >>> +--------------+--------+----------+----------+---------+------------+-= ----------+---------------+---------------+ >>> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | = Peak >>> Mem | Est. Peak Mem | Detail | >>> +--------------+--------+----------+----------+---------+------------+-= ----------+---------------+---------------+ >>> | 06:AGGREGATE | 1 | 121.09us | 121.09us | 1 | 1 | = 64.00 >>> KB | -1 B | FINALIZE | >>> | 05:EXCHANGE | 1 | 61.82us | 61.82us | 12 | 1 | = 0 B >>> | -1 B | UNPARTITIONED | >>> | 02:AGGREGATE | 12 | 3.71ms | 5.53ms | 12 | 1 | = 16.00 >>> KB | 10.00 MB | | >>> | 04:AGGREGATE | 12 | 171.00ms | 181.15ms | 5.09M | 5.41M | >>> 154.58 MB | 11.57 MB | | >>> | 03:EXCHANGE | 12 | 12.85ms | 14.27ms | 7.93M | 5.41M | = 0 B >>> | 0 B | HASH(c) | >>> | 01:AGGREGATE | 12 | 2.72s | 4.80s | 7.93M | 5.41M | >>> 170.08 MB | 138.88 MB | STREAMING | >>> | 00:SCAN KUDU | 12 | 991.95ms | 5.38s | 963.00M | 963.00M | = 2.30 >>> MB | 0 B | | >>> +--------------+--------+----------+----------+---------+------------+-= ----------+---------------+---------------+ >>> >>> Why impala did not used column statistics information? >>> >>> Question 3. If i can set statistics value manually, can impala understa= nds >>> that? >>> it seems impala do not use computed statistics information. >>> >>> I working on this, but it's hard to know more. >>> >>> Thanks! Have a nice day. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>