hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: Vectorization with UDFs returns incorrect results
Date Fri, 30 May 2014 14:52:48 GMT
When you turn "vectorized" on, does the following query consistently return 10000 in the output?
select ten_thousand() from testTabOrc
Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results
From: bbowman410@gmail.com
To: user@hive.apache.org

Hive 0.13 & Hadoop 2.4
I am having an issue when using the combination of vectorized query execution, BETWEEN, and
a custom UDF.  When I have vectorization on, my query returns an empty set.  When I then turn
vectorization off, my query returns the correct results. 

Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1
My UDFs seem to be working for everything else except this specific circumstance.  Is this
a issue in the hive software or am I writing my UDFs in such a way that they do not work with
vectorization?  If the latter, what is the correct way?

I created a test scenario where I was able to reproduce this problem I am seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000):  
package com.test;

import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String;
import java.lang.*;
public class tenThousand extends UDF {

  private final LongWritable result = new LongWritable();

  public LongWritable evaluate() {    result.set(10000);    return result;
  }}
TEST DATA (test.input):1|CBCABC|12
2|DBCABC|133|EBCABC|1440000|ABCABC|1550000|BBCABC|16
60000|CBCABC|17
CREATING ORC TABLE:
0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second varchar(20),
third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets
stored as orc tblproperties ("orc.compress" = "SNAPPY", "orc.index" = "true");

CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint,
second varchar(20), third int) partitioned by (range int) row format delimited fields terminated
by '|' stored as textfile;

COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input /db/loading/.

ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict
--hiveconf hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) select
* from loadingDir;"

LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db>  add jar /opt/hadoop/lib/testFunction.jar

0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand as 'com.test.tenThousand';


TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db>  set hive.vectorized.execution.enabled=false;

QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db> select first from testTabOrc
where first between ten_thousand()-10000 and ten_thousand()-9995;
+--------+| first  |+--------+
| 1      || 2      || 3      |+--------+
3 rows selected (15.286 seconds)
TURN ON VECTORIZATION:
0: jdbc:hive2://server:10002/db>  set hive.vectorized.execution.enabled=true;
QUERY AGAIN (WRONG RESULTS):
0: jdbc:hive2://server:10002/db> select first from testTabOrc where first between ten_thousand()-10000
and ten_thousand()-9995;
+--------+| first  |+--------++--------+
No rows selected (17.763 seconds)


 		 	   		  
Mime
View raw message