hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-8292) Reading from partitioned bucketed tables has high overhead in MapOperator.cleanUpInputFileChangedOp
Date Mon, 29 Sep 2014 18:53:34 GMT

     [ https://issues.apache.org/jira/browse/HIVE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mostafa Mokhtar updated HIVE-8292:
----------------------------------
    Description: 
Reading from bucketed partitioned tables has significantly higher overhead compared to non-bucketed
non-partitioned files.


20% of the profile is spent in MapOperator.cleanUpInputFileChangedOp

5% the CPU in 
{code}
 Path onepath = normalizePath(onefile);
{code}

And 
15% the CPU in 
{code}
 onepath.toUri().relativize(fpath.toUri()).equals(fpath.toUri());
{code}

>From the profiler 
{code}
Stack Trace	Sample Count	Percentage(%)
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(Object)	978	28.613
   org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(Writable)	978	28.613
      org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged()	866	25.336
         org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp()	866	25.336
            java.net.URI.relativize(URI)	655	19.163
               java.net.URI.relativize(URI, URI)	655	19.163
                  java.net.URI.normalize(String)	517	15.126
					java.net.URI.needsNormalization(String)	372	10.884
					   java.lang.String.charAt(int)	235	6.875
									  java.net.URI.equal(String, String)	27	0.79
									  java.lang.StringBuilder.toString()	1	0.029
									  java.lang.StringBuilder.<init>()	1	0.029
									  java.lang.StringBuilder.append(String)	1	0.029
								org.apache.hadoop.hive.ql.exec.MapOperator.normalizePath(String)	167	4.886
								   org.apache.hadoop.fs.Path.<init>(String)	162	4.74
									  org.apache.hadoop.fs.Path.initialize(String, String, String, String)	162	4.74
	org.apache.hadoop.fs.Path.normalizePath(String, String)	97	2.838
	   org.apache.commons.lang.StringUtils.replace(String, String, String)	97	2.838
		  org.apache.commons.lang.StringUtils.replace(String, String, String, int)	97	2.838
			 java.lang.String.indexOf(String, int)	97	2.838
		java.net.URI.<init>(String, String, String, String, String)	65	1.902
{code}


  was:
Reading from bucketed partitioned tables has significantly higher overhead compared to non-bucketed
non-partitioned files.


50% of the time is spent in these two lines of code in OrcInputFormate.getReader()
{code}
    String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY,
                                Long.MAX_VALUE + ":");
    ValidTxnList validTxnList = new ValidTxnListImpl(txnString);
{code}

{code}
Stack Trace	Sample Count	Percentage(%)
hive.ql.exec.tez.MapRecordSource.pushRecord()	2,981	87.215
	org.apache.tez.mapreduce.lib.MRReaderMapred.next()	2,002	58.572
    	mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object, Object)
2,002	58.572
			mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader()
1,984	58.046
            	hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf, Reporter)	1,983
58.016
                	hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter)
1,891	55.325
                    	hive.ql.io.orc.OrcInputFormat.getReader(InputSplit, AcidInputFormat$Options)
1,723	50.41
                        	hive.common.ValidTxnListImpl.<init>(String)	934	27.326
                            conf.Configuration.get(String, String)	621	18.169
 {code}

Another 20% of the profile is spent in MapOperator.cleanUpInputFileChangedOp

5% the CPU in 
{code}
 Path onepath = normalizePath(onefile);
{code}

And 
15% the CPU in 
{code}
 onepath.toUri().relativize(fpath.toUri()).equals(fpath.toUri());
{code}

>From the profiler 
{code}
Stack Trace	Sample Count	Percentage(%)
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(Object)	978	28.613
   org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(Writable)	978	28.613
      org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged()	866	25.336
         org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp()	866	25.336
            java.net.URI.relativize(URI)	655	19.163
               java.net.URI.relativize(URI, URI)	655	19.163
                  java.net.URI.normalize(String)	517	15.126
					java.net.URI.needsNormalization(String)	372	10.884
					   java.lang.String.charAt(int)	235	6.875
									  java.net.URI.equal(String, String)	27	0.79
									  java.lang.StringBuilder.toString()	1	0.029
									  java.lang.StringBuilder.<init>()	1	0.029
									  java.lang.StringBuilder.append(String)	1	0.029
								org.apache.hadoop.hive.ql.exec.MapOperator.normalizePath(String)	167	4.886
								   org.apache.hadoop.fs.Path.<init>(String)	162	4.74
									  org.apache.hadoop.fs.Path.initialize(String, String, String, String)	162	4.74
	org.apache.hadoop.fs.Path.normalizePath(String, String)	97	2.838
	   org.apache.commons.lang.StringUtils.replace(String, String, String)	97	2.838
		  org.apache.commons.lang.StringUtils.replace(String, String, String, int)	97	2.838
			 java.lang.String.indexOf(String, int)	97	2.838
		java.net.URI.<init>(String, String, String, String, String)	65	1.902
{code}



> Reading from partitioned bucketed tables has high overhead in MapOperator.cleanUpInputFileChangedOp
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8292
>                 URL: https://issues.apache.org/jira/browse/HIVE-8292
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>         Environment: cn105
>            Reporter: Mostafa Mokhtar
>            Assignee: Prasanth J
>             Fix For: 0.14.0
>
>
> Reading from bucketed partitioned tables has significantly higher overhead compared to
non-bucketed non-partitioned files.
> 20% of the profile is spent in MapOperator.cleanUpInputFileChangedOp
> 5% the CPU in 
> {code}
>  Path onepath = normalizePath(onefile);
> {code}
> And 
> 15% the CPU in 
> {code}
>  onepath.toUri().relativize(fpath.toUri()).equals(fpath.toUri());
> {code}
> From the profiler 
> {code}
> Stack Trace	Sample Count	Percentage(%)
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(Object)	978	28.613
>    org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(Writable)	978	28.613
>       org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged()	866	25.336
>          org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp()	866	25.336
>             java.net.URI.relativize(URI)	655	19.163
>                java.net.URI.relativize(URI, URI)	655	19.163
>                   java.net.URI.normalize(String)	517	15.126
> 					java.net.URI.needsNormalization(String)	372	10.884
> 					   java.lang.String.charAt(int)	235	6.875
> 									  java.net.URI.equal(String, String)	27	0.79
> 									  java.lang.StringBuilder.toString()	1	0.029
> 									  java.lang.StringBuilder.<init>()	1	0.029
> 									  java.lang.StringBuilder.append(String)	1	0.029
> 								org.apache.hadoop.hive.ql.exec.MapOperator.normalizePath(String)	167	4.886
> 								   org.apache.hadoop.fs.Path.<init>(String)	162	4.74
> 									  org.apache.hadoop.fs.Path.initialize(String, String, String, String)	162	4.74
> 	org.apache.hadoop.fs.Path.normalizePath(String, String)	97	2.838
> 	   org.apache.commons.lang.StringUtils.replace(String, String, String)	97	2.838
> 		  org.apache.commons.lang.StringUtils.replace(String, String, String, int)	97	2.838
> 			 java.lang.String.indexOf(String, int)	97	2.838
> 		java.net.URI.<init>(String, String, String, String, String)	65	1.902
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message