On Thu, Apr 27, 2017 at 12:38 AM, 기준 <0ctopus13prime@gmail.com> wrote:
> Hi Jeszy!
>
> First of all, thanks for your kind reply.
>
> It took me some times to read the article you suggested,
> and tried to apply my environment.
>
> Question 1. Why column statistics have inaccurate value?
I think Jeszy's answer addressed why there can be inaccuracies.
> Question 2. After command `alter table analysis_data set
> tblproperties('numRows'='1001000000',
> 'STATS_GENERATED_VIA_STATS_TASK'='true');`
> Still '# Rows' has 1 value.
>
> ++++++
>  # Rows  Start Key  Stop Key  Leader Replica
>  # Replicas 
>  1   00000001  ${host1}  3 
>  1  00000001  00000002  ${host2}  3 
>  1  00000002  00000003  ${host3}  3 
>
>
> And another command `alter table partitioned_data partition(year=2009,
> month=4) set tblproperties ('numRows'='30000',
> 'STATS_GENERATED_VIA_STATS_TASK'='true');`
> does not applied kudu table.
These are indicating that there are no _perpartition_ statistics. You
set the table statistics, which is different and should have been
applied. There is not yet a way to get perpartition statistics for
Kudu  for now you'll always see 1 there.
Take a look at 'describe extended tbl' to see the num_rows property.
Also show column stats should work as well.
For example:
[localhost:21000] > compute stats tdata;
Query: compute stats tdata
++
 summary 
++
 Updated 1 partition(s) and 9 column(s). 
++
Fetched 1 row(s) in 0.51s
[localhost:21000] > show column stats tdata;
Query: show column stats tdata
+++++++
 Column  Type  #Distinct Values  #Nulls  Max Size  Avg Size 
+++++++
 id  INT  1494  1  4  4 
 valf  FLOAT  1  1  4  4 
 vali  BIGINT  1  1  8  8 
 valv  STRING  1  1  3  3 
 valb  BOOLEAN  2  1  1  1 
 valt  TINYINT  1  1  1  1 
 vals  SMALLINT  1  1  2  2 
 vald  DOUBLE  1  1  8  8 
 ts  TIMESTAMP  1  1  16  16 
+++++++
[localhost:21000] > describe extended tdata;
Query: describe extended tdata
++++
 name  type
 comment 
++++
 # col_name  data_type
 comment 
  NULL
 NULL 
...
  NULL
 NULL 
 # Detailed Table Information  NULL
 NULL 
 Database:  default
 NULL 
...
 Table Parameters:  NULL
 NULL 
  DO_NOT_UPDATE_STATS
 true 
  STATS_GENERATED_VIA_STATS_TASK
 true 
  kudu.master_addresses
 127.0.0.1 
  kudu.table_name
 impala::default.tdata 
 ==========>  numRows 
1500 
...
>
> Is there any command set rows information into kudu's stats?
You did. You set the number of rows on the table. As I said there is
no way to set partition stats, nor would it make any difference to the
planner yet. You can reference IMPALA2830 to track work related to
ImpalaKudu statistics in the future.
>
> Thanks! Have a nice day :)
>
>
>
>
>
> 20170425 20:41 GMT+09:00 Jeszy <jeszyb@gmail.com>:
>> Hey,
>>
>> The difference in the distinct values is expected, the estimation that
>> the NDV function
>> (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_ndv.html)
>> gives is good enough, and the execution is much faster. You can set
>> both the table and the column stats manually as described here:
>> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_stats.html#perf_table_stats_manual.
>>
>> From your question it seems you expected Impala to return the column
>> statisticsstored value for the distinct count. This is not possible
>> currently, the intent with these statistics is to help planning by
>> allowing Impala to come up with a more educated guess on join ordering
>> and cardinality  if you look at a summary like the one you pasted,
>> the estimated values (both #rows and peak mem) is influenced by stats.
>> There is no way to tell whether the stats are stale (only whether they
>> are computed or not).
>>
>> Does this answer your questions?
>>
>> 20170425 13:26 GMT+02:00 기준 <0ctopus13prime@gmail.com>:
>>> Hi!
>>>
>>> I'm using impalakudu currently.
>>>
>>> impala's version is v2.7.0
>>> kudu's version is 1.3
>>>
>>> I found out table statistics hint few days ago.
>>>
>>> So i tried compute statistics using command `compute stats`.
>>>
>>> After short time no errors shown my screen, but all the rows was 1.
>>>
>>> So i searched about this, then i could find this one.
>>> https://issues.apache.org/jira/browse/IMPALA2830
>>>
>>> Question 1. Can i manually set rows?
>>>
>>> And i found column statistics computed with wrong value.
>>>
>>> For example, some column's actual distinct value was 5092153,
>>> but command `show column stats ${table}` shows 5405440.
>>> (Similar other columns too)
>>>
>>> Question 2. Why this difference happens? Also, can i set value manually?
>>>
>>> And after all i'm not clear impala use this information during query
>>> processing.
>>>
>>> For example,
>>> Issued `SELECT COUNT(DISTINCT ${column}) FROM ${table}`,
>>> and i found impala scan from kudu using `summary` command.
>>>
>>> ++++++++++
>>>  Operator  #Hosts  Avg Time  Max Time  #Rows  Est. #Rows  Peak
>>> Mem  Est. Peak Mem  Detail 
>>> ++++++++++
>>>  06:AGGREGATE  1  121.09us  121.09us  1  1  64.00
>>> KB  1 B  FINALIZE 
>>>  05:EXCHANGE  1  61.82us  61.82us  12  1  0 B
>>>  1 B  UNPARTITIONED 
>>>  02:AGGREGATE  12  3.71ms  5.53ms  12  1  16.00
>>> KB  10.00 MB  
>>>  04:AGGREGATE  12  171.00ms  181.15ms  5.09M  5.41M 
>>> 154.58 MB  11.57 MB  
>>>  03:EXCHANGE  12  12.85ms  14.27ms  7.93M  5.41M  0 B
>>>  0 B  HASH(c) 
>>>  01:AGGREGATE  12  2.72s  4.80s  7.93M  5.41M 
>>> 170.08 MB  138.88 MB  STREAMING 
>>>  00:SCAN KUDU  12  991.95ms  5.38s  963.00M  963.00M  2.30
>>> MB  0 B  
>>> ++++++++++
>>>
>>> Why impala did not used column statistics information?
>>>
>>> Question 3. If i can set statistics value manually, can impala understands
>>> that?
>>> it seems impala do not use computed statistics information.
>>>
>>> I working on this, but it's hard to know more.
>>>
>>> Thanks! Have a nice day.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
