hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Navis Ryu" <navis....@nexr.com>
Subject Re: Review Request 24688: parallel order by clause on a string column fails with IOException: Split points are out of order
Date Tue, 26 Aug 2014 02:31:23 GMT


> On Aug. 25, 2014, 7:08 p.m., Szehon Ho wrote:
> > Looks like an important bug to fix, but I dont know too much about this code, can
you explain what is the bug in the getPartitionKey algorithm, and what is the fix?  Like why
we need to alter the stepSize as we iterate.  Is there a test we can add for this as well
to illustrate and validate the fix?
> > 
> > Also my confusion is if the other fixes on the patch are related?
> > 
> > 1.  Adding setConf on the HiveTotalOrderPartitioner is related to the bug?
> > 2.  What is the use of the new HiveConf "..min.reducer"?  My guess is you found
the algorithm not generating enough partitionKey sometimes, can you explain?

If sampled data is like "10 10 10 10 10 20" with 3 reducers, stepSize=2 (6/3), which makes
datum[2], datum[4] as partition key. But beacuse these two are same value(10), TotalOrderPartitioner
throws exception. I'll try to make a proper test case for this.

The code part 
```
while (last >= k && C.compare(sorted[last], sorted[k]) == 0) {
  ++;
}
```      
Intended to increase index if previous key is same with current key to make a difference,
but it does not work because "last >= k" is always true when i > 1.

bq. Adding setConf on the HiveTotalOrderPartitioner is related to the bug?

As commented, it's temporary fix for TEZ-1403, which is a bug not providing JobConf instance
to Partitioners.

bq. What is the use of the new HiveConf "..min.reducer"?

If sampled data does not contain more unique values than the number of reducer, it's possible
to return smaller number of partitions than expected (in current implementation as you thought).
It can be bad if someone wants 100 reducer running but it's running only 4 reducer.


> On Aug. 25, 2014, 7:08 p.m., Szehon Ho wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 1038
> > <https://reviews.apache.org/r/24688/diff/1/?file=660082#file660082line1038>
> >
> >     If this needs to be exposed, should be worded better.  Something like:
> >     
> >     name = "hive.optimize.sampling.orderby.min.reducer.ratio"
> >     
> >     "If sampling is enabled, this is the minimum ratio allowed of reducers calculated
by sampling to expected number of reducers".
> >     
> >     Its might be confusing to user in my opinion, as the user has little control
of what the expected reducer is, right?

Good!


- Navis


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24688/#review51426
-----------------------------------------------------------


On Aug. 14, 2014, 2:29 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24688/
> -----------------------------------------------------------
> 
> (Updated Aug. 14, 2014, 2:29 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-7669
>     https://issues.apache.org/jira/browse/HIVE-7669
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The source table has 600 Million rows and it has a String column "l_shipinstruct" which
has 4 unique values. (Ie. these 4 values are repeated across the 600 million rows)
> 
> We are sorting it based on this string column "l_shipinstruct" as shown in the below
HiveQL with the following parameters. 
> {code:sql}
> set hive.optimize.sampling.orderby=true;
> set hive.optimize.sampling.orderby.number=10000000;
> set hive.optimize.sampling.orderby.percent=0.1f;
> 
> insert overwrite table lineitem_temp_report 
> select 
>   l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount,
l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct,
l_shipmode, l_comment
> from 
>   lineitem
> order by l_shipinstruct;
> {code}
> Stack Trace
> Diagnostic Messages for this Task:
> {noformat}
> Error: java.lang.RuntimeException: Error in configuring object
>         at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
>         at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
>         at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>         at org.apache.hadoop.mapred.MapTask$OldOutputCollector.<init>(MapTask.java:569)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> Caused by: java.lang.reflect.InvocationTargetException
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:601)
>         at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
>         ... 10 more
> Caused by: java.lang.IllegalArgumentException: Can't read partitions file
>         at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
>         at org.apache.hadoop.mapred.lib.TotalOrderPartitioner.configure(TotalOrderPartitioner.java:42)
>         at org.apache.hadoop.hive.ql.exec.HiveTotalOrderPartitioner.configure(HiveTotalOrderPartitioner.java:37)
>         ... 15 more
> Caused by: java.io.IOException: Split points are out of order
>         at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:96)
>         ... 17 more
> {noformat}
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java af9e198 
>   common/src/java/org/apache/hadoop/hive/conf/Validator.java cea9c41 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HiveTotalOrderPartitioner.java 6c22362 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java 166461a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecDriver.java ef72039 
> 
> Diff: https://reviews.apache.org/r/24688/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message