hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang" <jxi...@cloudera.com>
Subject Re: Review Request 30739: HIVE-9574 Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch]
Date Sun, 08 Feb 2015 18:32:00 GMT


> On Feb. 7, 2015, 6:38 p.m., Xuefu Zhang wrote:
> > High-level comments:
> > 1. File name convention needs to take care of multiple threads doing the same thing
in a single JVM.
> > 2. Avoid creating Tuple2 objects. There is no need to have this object, but we create
them just to wrap key/value, likely increase GC pressure.
> > 3. Disk writing can be more performing. We should be able to put keys/values as
bytes in a large byte[] (as the buffer). The we need to spill, we write the whole byte[] to
disk in one shot.
> > 4. Related to #3, but I'm wondering why we need kryo at all.
> > 5. We need to take care of file closing in case of exceptions. The call usually
goes into a finally block.
> > 6. It's better to allocate the buffer in the beginning and fail right way in case
of OOM. We don't want the job to fail when only the last row needs to spill and incur OOM
then.

1. As in the javadoc, HiveKVResultCache is not thread safe, each HiveBaseFunctionResultList
instance should have its own cache. Any suggestion as how to change the file name?
2. The next() returns a Tuple2. We need to create a Tuple2 before returning the pair. Are
you saying we puth the pair in buffer without creating a Tuple2?
3. Output has a buffer itself. With the current way, we don't need to create another buffer.
4. Kryo is not used here. We just used Input and Output. The reason is that they both have
buffers inside. Input has a way to tell if we are done with the file before trying to read
from it.
5. Good point, will fix that.
6. The buffer iteself doesn't need much memory. Yes, we can pre-allocate it. Will fix that.


- Jimmy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30739/#review71546
-----------------------------------------------------------


On Feb. 7, 2015, 3:09 a.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30739/
> -----------------------------------------------------------
> 
> (Updated Feb. 7, 2015, 3:09 a.m.)
> 
> 
> Review request for hive, Rui Li and Xuefu Zhang.
> 
> 
> Bugs: HIVE-9574
>     https://issues.apache.org/jira/browse/HIVE-9574
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Result KV cache doesn't use RowContainer any more since it has logic we don't need, which
is some overhead. We don't do lazy computing right away, instead we wait a little till the
cache is close to spill.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java 78ab680

>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java 8ead0cb 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunction.java 7a09b4d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunctionResultList.java e92e299

>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 070ea4d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java
d4ff37c 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/KryoSerializer.java 286816b 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 0df4598

> 
> Diff: https://reviews.apache.org/r/30739/diff/
> 
> 
> Testing
> -------
> 
> Unit test, test on cluster
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message