hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bennie Leo <tben...@hotmail.com>
Subject Hive indexing optimization
Date Fri, 26 Jun 2015 00:48:11 GMT




Hi,
 
I am attempting to optimize a query using indexing. My current query converts an ipv4 address
to a country using a geolocation table. However, the geolocation table is fairly large and
the query takes an impractical amount of time. I have created indexes and set the binary search
parameter to true (default), but the query is not faster. 
 
Here is how I set up indexing:


DROP INDEX IF EXISTS ipv4indexes ON ipv4geotable;
CREATE INDEX ipv4indexes 
ON TABLE ipv4geotable (StartIp, EndIp)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES  ('hive.index.compact.binary.search'='true');
 
ALTER INDEX ipv4indexes ON ipv4geotable REBUILD;
 
And here is my query:
 
DROP TABLE IF EXISTS ipv4table;
CREATE TABLE ipv4table AS
SELECT logon.IP, ipv4.Country
FROM 
(SELECT * FROM logontable WHERE isIpv4(IP)) logon
LEFT OUTER JOIN
(SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) 
WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp;
 
What the query is doing is extracting an IP from logontable and finding in which range it
lies within the geolocation table (which is sorted). When a range is found, the corresponding
country is returned. The problem is that Hive goes through the whole table row by row rather
than performing a smart search (ex: binary search). 
 
Any suggestions on how to speed things up? 
 
Thank you,
B
 		 	   		   		 	   		  
Mime
View raw message