hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Na Yang" <ny...@maprtech.com>
Subject Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]
Date Fri, 07 Nov 2014 06:34:04 GMT


> On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 220
> > <https://reviews.apache.org/r/27719/diff/1/?file=754282#file754282line220>
> >
> >     Could you please remove the trailing spaces?

Sure.


> On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote:
> > ql/src/test/results/clientpositive/spark/stats1.q.out, line 182
> > <https://reviews.apache.org/r/27719/diff/1/?file=754283#file754283line182>
> >
> >     This seems slightly different from MR's output. I'm wondering if this is expected.

Xuefu, thank you for doing the code review. The spark output is missing one filesinkoperator's
stats data. I need to fix that.


On Nov. 7, 2014, 3:32 a.m., Na Yang wrote:
> > The original code is pretty much cloned from Tez, I'm wondering if Tez suffers the
same problem.

We modified the remove union code in spark by removing the newly cloned FileSinkOperators
from the fileSinkSet to avoid generating multiple duplicated merge tasks.  However, this caused
the stats flag missing from the cloned FileSinkOperators which are actually used in the SparkWork.
My current patch only adds the stats flag to one of the cloned FileSinkOperators, not all
of the cloned FileSinkOperators. That causes the wrong output. I will re-consider the fix
and update the patch accordingly. Thank you Xuefu for the code review!


- Na


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60294
-----------------------------------------------------------


On Nov. 7, 2014, 2:35 a.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 2:35 a.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator
in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators,
the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce
work. However, during processFileSink, the original FileSinkOperator is set the collectStats
tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.
 
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568

>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message