hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jesús Camacho Rodríguez <jcamachorodrig...@hortonworks.com>
Subject Re: Review Request 47554: HIVE-13750
Date Thu, 19 May 2016 16:16:42 GMT


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
line 556
> > <https://reviews.apache.org/r/47554/diff/1/?file=1387390#file1387390line556>
> >
> >     Can you also add comment why this is valid only for RS added by SPDO and not
in general?
> 
> Jesús Camacho Rodríguez wrote:
>     I was thinking further about this, and indeed I made many assumptions thinking that
SPDO will not add the sort columns at the end of the additional RS keys, so then we need to
make sure the ordering is correct... but this is not true, as it will add them.
>     
>     I think this optimization can be done general, as ordering before ordering again
is a noop. I have changed the code accordingly, I would like your opinion. Am I missing something?
Let's see what QA returns...
> 
> Ashutosh Chauhan wrote:
>     yeah.. thats what I think.. this is not specific to SPDO at logical layer. 
>     Thing I am not 100% sure about is in FileSink the way SPDO is implemented I *think*
it assumes all rows for a particular key come in sorted and in one batch. Now, when we merge
and sort by two keys and if there is a case that FS was expecting all rows sorted on second
key only, now will get them sorted on 2 keys and it may have to close ORC writer and then
reopen again, as oppose to write it once and never opening it again.

I am trying to understand this part... The way the optimization works, we will keep the second
RS with its keys, we just remove the first RS... Thus, this should not be a problem? Or maybe
I am not understanding it correctly


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
line 584
> > <https://reviews.apache.org/r/47554/diff/1/?file=1387390#file1387390line584>
> >
> >     Currently, _buceket_number_ comes in as constant when it really its not. Will
that break assumption about ignoring constant?
> 
> Jesús Camacho Rodríguez wrote:
>     No, precisely that is the reason why we ignore constants. Also, because ordering
by a constant will not change the order of the records.
> 
> Ashutosh Chauhan wrote:
>     Problem is although type of _bucket_number_ is ExprNodeConstantDesc, its *not* a
constant. Its created as a placeholder and value of this column is calculated on a per-row
basis in ReduceSinkOperator::computeBucketNumber() which means at run time its not a constant.
Correct fix for this is create _bucket_number_ as ExprNodeColumnDesc in SPDO.

OK, I understand... This has changed in the new patch, so I do not think this is an issue.
Nevertheless, I think the correct way of implementing it is as you just explained (feels kind
of a hack to use ExprNodeConstantDesc for that).


- Jesús


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
-----------------------------------------------------------


On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> -----------------------------------------------------------
> 
> (Updated May 19, 2016, 10:49 a.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
>     https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-13750
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java 010c89ed978296709b052cc7bc80256a27658e2b

>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out d03bfe422743d9a5a6b85f9a6198e1e27024f129

>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out dec872ab0eef54bd92d5c2bc068e2805cc14e272

>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 832580325873dee741ba86239ee571873994a808

>   ql/src/test/results/clientpositive/reducesink_dedup.q.out b89df52f965385b85894757896eee487b29c52ae

>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out a90e3f63b4646cf0ade9785a501ebd1a6b2a3406

>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 723e8192f2735059005fc3c5c96732a2c4be49c1

> 
> Diff: https://reviews.apache.org/r/47554/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message