hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <>
Subject Re: help with compression and index
Date Tue, 21 Feb 2012 22:03:20 GMT
Hi Robert,
As per, Hive 0.8 introduces automatic accessing
of indexes. That might come in handy too!


Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: www: 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 

----- Original Message -----
From: "Bejoy Ks" <>
Sent: Tuesday, February 21, 2012 11:47:56 AM
Subject: Re: help with compression and index

Hi Hamilton 
When you are doing indexing(generate index files) is compression enabled? If so you are running
into this known issue 

Which is fixed in hive 0.8 . An upgrade should get it rolling for you and is recommended.


From: "Hamilton, Robert (Austin)" <> 
To: "" <> 
Sent: Tuesday, February 21, 2012 8:48 PM 
Subject: help with compression and index 

Hi all. I sent this to common-user@hadoop hoping there was an easy answer but got no response.

I have a couple of users who basically have no use case other than the need to extract specific
rows based on some predetermined set of keys, so I would like to be able to just provide them
with an index and show them how to join to the detail table using the index. So I'm looking
for a reliable compression+index method with hive. To get an idea of the data size my files
add up to about 80TB uncompressed but currently gzipped to only 10 TB - I need to keep it
small (ish) until I can get more disk space, so it has to stay compressed. 

I don't mind recompressing to LZO or bzip but need to prove that it would actually work first

I've done my testing on LZO and uncompressed test samples. If I use uncompressed files the
indexed select works OK. If I use LZO it returns only a fraction of the rows I expect. I gather
that files compressed with other compression methods cannot be indexed at all with Hive 0.7.1?

I'm following the prescription to select buckets/offets into a temporary file, set hive.index.compact.file
to the temp file, set hive.input.format to HiveCompactIndexInputFormat and run my select.
That doesn't let me do subselects but I don't mind as it is only a very limited use case that
I need to support. 

This is the only method I could find documented on the net. Is there a better way to do this?
I don't mind upgrading Hive (currently on 0.7.1) or Hadoop (currently 0.20.2)? 

View raw message