hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5009) Fix minor optimization issues
Date Mon, 12 Aug 2013 07:42:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736667#comment-13736667
] 

Thejas M Nair commented on HIVE-5009:
-------------------------------------

Based on the comments in HIVE-3739, it might be related to your jdk version. If you are using
jdk7, you might want to check if jdk6 helps. Please let us know on jira if you find a way
out.
                
> Fix minor optimization issues
> -----------------------------
>
>                 Key: HIVE-5009
>                 URL: https://issues.apache.org/jira/browse/HIVE-5009
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Benjamin Jakobus
>            Assignee: Benjamin Jakobus
>            Priority: Minor
>             Fix For: 0.12.0
>
>         Attachments: AbstractBucketJoinProc.java
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I have found some minor optimization issues in the codebase, which I would like to rectify
and contribute. Specifically, these are:
> The optimizations that could be applied to Hive's code base are as follows:
> 1. Use StringBuffer when appending strings - In 184 instances, the concatination operator
(+=) was used when appending strings. This is inherintly inefficient - instead Java's StringBuffer
or StringBuilder class should be used. 12 instances of this optimization can be applied to
the GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver uses the +
operator inside a loop, so does the column projection utilities class (ColumnProjectionUtils)
and the aforementioned skew-join processor. Tests showed that using the StringBuilder when
appending strings is 57\% faster than using the + operator (using the StringBuffer took 122
milliseconds whilst the + operator took 284 milliseconds). The reason as to why using the
StringBuffer class is preferred over using the + operator, is because
> String third = first + second;
> gets compiled to:
> StringBuilder builder = new StringBuilder( first );
> builder.append( second );
> third = builder.toString();
> Therefore, when building complex strings, that, for example involve loops, require many
instantiations (and as discussed below, creating new objects inside loops is inefficient).
> 2. Use arrays instead of List - Java's java.util.Arrays class asList method is a more
efficient at creating  creating lists from arrays than using loops to manually iterate over
the elements (using asList is computationally very cheap, O(1), as it merely creates a wrapper
object around the array; looping through the list however has a complexity of O(n) since a
new list is created and every element in the array is added to this new list). As confirmed
by the experiment detailed in Appendix D, the Java compiler does not automatically optimize
and replace tight-loop copying with asList: the loop-copying of 1,000,000 items took 15 milliseconds
whilst using asList is instant. 
> Four instances of this optimization can be applied to Hive's codebase (two of these should
be applied to the Map-Join container - MapJoinRowContainer) - lines 92 to 98:
>  for (obj = other.first(); obj != null; obj = other.next()) {
>       ArrayList<Object> ele = new ArrayList(obj.length);
>       for (int i = 0; i < obj.length; i++) {
>         ele.add(obj[i]);
>       }
>       list.add((Row) ele);
>     }
> 3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation could be
avoided by simply using the provided static conversion methods. As noted in the PMD documentation,
"using these avoids the cost of creating objects that also need to be garbage-collected later."
> For example, line 587 of the SemanticAnalyzer class, could be replaced by the more efficient
parseDouble method call:
> // Inefficient:
> Double percent = Double.valueOf(value).doubleValue();
> // To be replaced by:
> Double percent = Double.parseDouble(value);
> Our test case in Appendix D confirms this: converting 10,000 strings into integers using
Integer.parseInt(gen.nextSessionId()) (i.e. creating an unnecessary wrapper object) took 119
on average; using parseInt() took only 38. Therefore creating even just one unnecessary wrapper
object can make your code up to 68% slower.
> 4. Converting literals to strings using + "" - Converting literals to strings using +
"" is quite inefficient (see Appendix D) and should be done by calling the toString() method
instead: converting 1,000,000 integers to strings using + "" took, on average, 1340 milliseconds
whilst using the toString() method only required 1183 milliseconds (hence adding empty strings
takes nearly 12% more time). 
> 89 instances of this using + "" when converting literals were found in Hive's codebase
- one of these are found in the JoinUtil.
> 5. Avoid manual copying of arrays - Instead of copying arrays as is done in GroupByOperator
on line 1040 (see below), the more efficient System.arraycopy can be used (arraycopy is a
native method meaning that the entire memory block is copied using memcpy or mmove).
> // Line 1040 of the GroupByOperator
> for (int i = 0; i < keys.length; i++) {
> 	forwardCache[i] = keys[i];
> }   
> Using System.arraycopy on an array of 10,000 strings was (close to) instant whilst the
manual copy took 6 milliseconds.
> 11 instances of this optimization should be applied to the Hive codebase.
> 6. Avoiding instantiation inside loops - As noted in the PMD documentation, "new objects
created within loops should be checked to see if they can created outside them and reused.".

> Declaring variables inside a loop (i from 0 to 10,000) took 300 milliseconds
> whilst declaring them outside took only 88 milliseconds (this can be explained by the
fact that when declaring a variable outside the loop, its reference will be re-used for each
iteration. However when declaring variables inside a loop, new references will be created
for each iteration. In our case, 10,000 references will be created by the time that this loop
finishes, meaning lots of work in terms of memory allocation and garbage collection). 1623
instances of this optimization can be applied.
> To summarize, I propose to modify the code to address issue 1 and issue 6 (remaining
issues (2 - 5) will be addressed later). Details are specified as sub-tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message