hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5936) analyze command failing to collect stats with counter mechanism
Date Thu, 05 Dec 2013 01:59:36 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839684#comment-13839684
] 

Prasanth J commented on HIVE-5936:
----------------------------------

Even ROW_COUNT and RAW_DATA_SIZE is not reliable. Following sequence of operations illustrate
it
{code}
hive> create table test (key string, value string);
OK
Time taken: 0.069 seconds
hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt' into table
test;
Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt
Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt
Loading data to table default.test
Table default.test stats: [numFiles, numRows, totalSize, rawDataSize]
OK
Time taken: 0.231 seconds
hive> desc formatted test;
OK
# col_name            	data_type           	comment             
	 	 
key                 	string              	None                
value               	string              	None                
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	pjayachandran       	 
CreateTime:         	Wed Dec 04 17:31:32 PST 2013	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	file:/tmp/warehouse/test	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	true                
	numFiles            	1                   
	numRows             	0                   
	rawDataSize         	0                   
	totalSize           	5812                
	transient_lastDdlTime	1386207121          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	serialization.format	1                   
Time taken: 0.094 seconds, Fetched: 32 row(s)
hive> drop table test;
OK
Time taken: 0.423 seconds
hive> set hive.stats.autogather=false;
hive> create table test (key string, value string);                                   
     
OK
Time taken: 0.03 seconds
hive> load data local inpath '/work/hive/trunk/hive-git/data/files/kv1.txt' into table
test;
Copying data from file:/work/hive/trunk/hive-git/data/files/kv1.txt
Copying file: file:/work/hive/trunk/hive-git/data/files/kv1.txt
Loading data to table default.test
OK
Time taken: 0.097 seconds
hive> desc formatted test;                                                            
     
OK
# col_name            	data_type           	comment             
	 	 
key                 	string              	None                
value               	string              	None                
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	pjayachandran       	 
CreateTime:         	Wed Dec 04 17:32:29 PST 2013	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	file:/tmp/warehouse/test	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	false               
	numFiles            	1                   
	numRows             	-1                  
	rawDataSize         	-1                  
	totalSize           	5812                
	transient_lastDdlTime	1386207152          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	serialization.format	1                   
Time taken: 0.061 seconds, Fetched: 32 row(s)
hive> set hive.stats.collect.rawdatasize=false;                                       
     
hive> analyze table test compute statistics;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
Listening for transport dt_socket at address: 65378
2013-12-04 17:35:55.379 java[81428:1003] Unable to load realm info from SCDynamicStore
Execution log at: /var/folders/2w/4x52xg597k50_bt27x3_k9tw0000gn/T//pjayachandran/pjayachandran_20131204173535_82f7e5c3-0016-4a63-a89c-e07b6ed07ab4.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-12-04 17:35:57,347 null map = 0%,  reduce = 0%
2013-12-04 17:36:14,366 null map = 100%,  reduce = 0%
Ended Job = job_local124477567_0001
Execution completed successfully
MapredLocal task succeeded
Table default.test stats: [numFiles, numRows, totalSize, rawDataSize]
OK
Time taken: 36.769 seconds
hive> desc formatted test;                     
OK
# col_name            	data_type           	comment             
	 	 
key                 	string              	None                
value               	string              	None                
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	pjayachandran       	 
CreateTime:         	Wed Dec 04 17:32:29 PST 2013	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	file:/tmp/warehouse/test	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	COLUMN_STATS_ACCURATE	true                
	numFiles            	1                   
	numRows             	500                 
	rawDataSize         	0                   
	totalSize           	5812                
	transient_lastDdlTime	1386207374          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	serialization.format	1                   
Time taken: 0.064 seconds, Fetched: 32 row(s)
hive> 
{code}

As seen above, statistics are different when autostats gathering is enabled vs disabled. Also,
not all SerDes support RAW_DATA_SIZE. AFAIK, LazySimpleSerde and ORC supports RAW_DATA_SIZE.
LazySimpleSerde supports RAW_DATA_SIZE during INSERT operation and ANALYZE. But ORC supports
only during INSERT operation. Since there are multiple codepaths/ways stats can be updated
I do not think RAW_DATA_SIZE and ROW_COUNT is reliable always. 

Following code segment is removed in HIVE-5921
{code}
if (nr < 0) {
  nr = 0;
}
{code}
instead if ROW_COUNT is <=0, the number of rows will be estimated based on average row
size computed from schema
{code}
      if (nr <= 0) {
        nr = 0;
        int avgRowSize = estimateRowSizeFromSchema(conf, schema, neededColumns);
        if (avgRowSize > 0) {
          nr = ds / avgRowSize;
        }
       }
{code}

There is another subtask HIVE-5949 which will have a flag to say if the statistics is accurate
(all statistics are from metastore) or estimated. 

> analyze command failing to collect stats with counter mechanism
> ---------------------------------------------------------------
>
>                 Key: HIVE-5936
>                 URL: https://issues.apache.org/jira/browse/HIVE-5936
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>    Affects Versions: 0.13.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Navis
>         Attachments: HIVE-5936.1.patch.txt, HIVE-5936.2.patch.txt
>
>
> With counter mechanism, MR job is successful, but StatsTask on client fails with NPE.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message