hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy...@yahoo.com>
Subject Re: Efficiently Store data in Hive
Date Thu, 02 Aug 2012 06:48:13 GMT
Hi Techy

LZO is not splittable on its own unless indexed. ie if you want your LZO compressed files
splittable, after compressing using LZO you need to index the same using LZO indexer. This
is mandatory for splittability if you use Text Files. 

But if you are using Sequence files, it has the splittability property on its own. So there
even if you don't index the LZO compressed files, it is still splittable as Sequence Files
do have this characteristic.

It is applicable for Snappy compression as well, Snappy is not natively splittable, Snappy
on Sequence files makes its splittable.

AFAIK, output compression is useful in the storage layer, while processing the data there
is a minor extra work involved in uncompressing the same. So I don't see a performance benefit for
mapreduce jobs on compressed data.

Where as if you use intermediate compression and if the data shuffled is large it saves on
network transfers there by increasing mapreduce performance to some extent.

 

Regards,
Bejoy KS


________________________________
 From: Techy Teck <comptechgeeky@gmail.com>
To: user@hive.apache.org 
Sent: Thursday, August 2, 2012 12:18 AM
Subject: Efficiently Store data in Hive
 

How can I efficiently store data in Hive and also store and
retrieve compressed data in hive?
Currently I am storing it as a TextFile.
I was going through Bejoy article (http://kickstarthadoop.blogspot.com/2011/10/how-to-efficiently-store-data-in-hive.html)
and I found that LZO compression will be good for storing the files and also it
is splittable.
 
I have one HiveQL Select query that is generating some output and I
am storing that output somewhere so that one of my Hive table (quality) can use
that data so that I can query that quality.
 
Below is the quality table in which I am loading the data
from the below SELECT query by making the partition I am using to overwrite
table quality.
 
create table quality
(id bigint,
  total bigint,
  error bigint
 )
 partitioned by (ds
string)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/uname/quality'
;
 
insert overwrite table quality partition (ds='20120709')
SELECT id  , count2 , coalesce(error,
cast(0 AS BIGINT)) AS count1  FROM Table1;
 
 
So here currently I am storing it as a TextFile, should I
make this as a Sequence file and start storing the data in LZO compression
format? Or text file will be fine here also? As from the select query I will be
getting some GB of data, that need to be uploaded on table quality on a daily
basis.


So which way is best? Should I store the output as a TextFile or SequenceFile format (LZO
compression) so that when I am query the Hive quality table, querying is faster.
Mime
View raw message