hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chengxiang li" <chengxiang...@intel.com>
Subject Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
Date Fri, 22 May 2015 02:38:46 GMT


> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java, line 41
> > <https://reviews.apache.org/r/34455/diff/1/?file=964754#file964754line41>
> >
> >     Currently the storage level is memory+disk. Any reason to change it to memory_only?

Cache data to disk means that data need serialization and deserialization, it's costly, and
sometime may overwhlem the gain of cache, and it's hard to measure programatically, as read
from source file just do deserialization, cache in disk need an additional serialization
Instead of add an optimizer which may or may not promote performance for user, i think it
may be better to narrow the the optimzir scope a little bit, to make sure this optimizer do
promote the performance.


> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java, line 63
> > <https://reviews.apache.org/r/34455/diff/1/?file=964756#file964756line63>
> >
> >     Can we keep the old code around. I understand it's not currently used.

Of course we can, it just make the code a little mess, you knon, for others who want to read
the cache related code.


> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java, line 25
> > <https://reviews.apache.org/r/34455/diff/1/?file=964757#file964757line25>
> >
> >     I cannot construct a case where a MapTran would need caching. Do you have an
example?

For any queries which contains SparkWork like this: MapWork --> ReduceWork    
                                         \ --> ReduceWork
for example, from person_orc insert overwrite table p1 select city, count(*) as s group by
city order by s insert overwrite table p2  select city, avg(age) as g group by city order
by g;


> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java, line
419
> > <https://reviews.apache.org/r/34455/diff/1/?file=964774#file964774line419>
> >
> >     Do you think it makes sense for us to release the cache as soon as the job is
completed, as it's done here?

Theoretically we does not need to, i mean it would not lead to any extra memory leak issue,
the only benefit of unpersist cache manually i can image is that it reduce GC effort, as Hive
do it programatically instead of let GC collect it.
The reason i remove it is that, it add extra complexility to code, and not expandable for
share cached RDD cross Spark job.


- chengxiang


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review84572
-----------------------------------------------------------


On 五月 20, 2015, 2:37 a.m., chengxiang li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
> 
> (Updated 五月 20, 2015, 2:37 a.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-10550
>     https://issues.apache.org/jira/browse/HIVE-10550
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> see jira description
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 19d3fee

>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 26cfebd 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 8b15099

>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java
5d62596 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
8e56263 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSkewJoinProcFactory.java
5990d17 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SplitSparkWorkResolver.java fb20080

>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 
>   spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java af6332e 
>   spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java beed8a3

>   spark-client/src/main/java/org/apache/hive/spark/client/MonitorCallback.java e1e899e

>   spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java b77c9e8 
>   spark-client/src/test/java/org/apache/hive/spark/client/TestSparkClient.java d33ad7e

> 
> Diff: https://reviews.apache.org/r/34455/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> chengxiang li
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message