hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Connell, Chuck" <>
Subject RE: Hive 0.9 and Indexing
Date Thu, 26 Jul 2012 14:52:49 GMT
I do not have answers to any of your questions, but I appreciate you raising them. My team
is very interested in Hive indexing as well, so I look forward to this discussion.

Chuck Connell
Nuance R&D Data Team
Burlington, MA

From: John Omernik []
Sent: Thursday, July 26, 2012 10:40 AM
Subject: Hive 0.9 and Indexing

I am playing with Hive indexing and a little discouraged by the gap between the potential
seen and the amount of documentation around indexing. I am running Hive 0.9 and started playing
with indexing as follows:

I have a table logs that has a bunch of fields but for this, lets say three. sessionutc, srcip,
dstip and partitioned by DAY.

CREATE TABLE logs(sessionutc STRING, srcip STRING, dstip STRING)

The field I am hoping to index is srcip, so I created this:

CREATE INDEX idx_srcip ON TABLE logs(srcip) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

ALTER INDEX idx_srcip ON pcaps REBUILD;

This alter index ran overnight.... Also be warned, the data in my table is 522GB.  The buidling
of index took all night and wrote a ton of data (49 GB) to the hive history file in the CLI...
that just aint right if you ask me. :) Can we limit that somehow?

The actual index table ended up being around 1.8 GB on 522GB of data.  That wasn't too bad
I guess.

Then time for queries... I thought Hive0.9 just supported indexes on queries, the old school
DB guy just ran a query

SELECT sessionutc, srcip, dstip FROM logs WHERE srcip=''

Started running 1028 Map Tasks... obviously not usiung an index, takes forever, hmmm..

So I google, and find this:

INSERT OVERWRITE DIRECTORY "/tmp/index_result" SELECT `_bucketname` ,  `_offsets` FROM web__logs_idx_srcip__
WHERE srcip = '';
SET hive.index.compact.file=/tmp/index_result;
SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip=''

43 Map task, done quickly. Indexes WORK! WOOOOOOO.

So Do Indexes work when you have a non-indexed field in your where clause.

Hypothesis: If you "and" your indexed field, it will work (like partition pruning)

SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='' and sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: True

Hypothesis If you "or" your indexed field it will not work (like partition pruning)
SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='' or sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: False

What the heck? HOW DOES THAT WORK? That's impossible. The data returned is truly that data
without the srcip = It's from all partitions.  I am so confused there.

Ok: Other questions.

How are indexed updated?  How does INSERT OVERWRITE or INSERT APPEND affect currently built
indexes? Are indexes rebuilt automatically or do you have to have separate job to rebuild
the indexes on partitions that you've added data too? If index updating is automagic, does
it slow INSERT times? Need to do some testing here.

Also: Is there a better way to enable Index hitting automagically without having to do the
separate index query and the two SET commands prior to my query I want to hit the index? I'd
like to utilize the indices to help our operations staff write better queries (read use less
cluster resources) but teaching that process may be difficult.

Any other good non-googlable sources of information on indexes, The reason I posted my results
and my questions here is the general lack of information around this topic.

View raw message