hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek <abhishek.dod...@gmail.com>
Subject Re: best way to load millions of gzip files in hdfs to one table in hive?
Date Tue, 02 Oct 2012 23:31:24 GMT
Hi Edward,

I am kind of interested in this, for crush to work do we need install any thing?? 

How can it be used in a cluster. 

Regards
Abhi

Sent from my iPhone

On Oct 2, 2012, at 5:45 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:

> You may want to use:
> 
> https://github.com/edwardcapriolo/filecrush
> 
> We use this to deal with pathological cases although the best idea is
> to avoid big files all together.
> 
> Edward
> 
> On Tue, Oct 2, 2012 at 4:16 PM, Alexander Pivovarov
> <apivovarov@gmail.com> wrote:
>> Options
>> 1. create table and put files under the table dir
>> 
>> 2. create external table and point it to files dir
>> 
>> 3. if files are small then I recomend to create new set of files using
>> simple MR program and specifying number of reduce tasks. Goal is to make
>> files size > hdfs block size (it safes NN memory and read will be faster)
>> 
>> 
>> On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang <zuohua@gmail.com> wrote:
>>> 
>>> I have millions of gzip files in hdfs (with the same fields), would like
>>> to load them into one table in hive with a specified schema.
>>> What is the most efficient ways to do that?
>>> Given that my data is only in hdfs, and also gzipped, does that mean I
>>> could just simply set up the table somehow bypassing some unnecessary
>>> overhead of the typical approach?
>>> 
>>> Thanks!
>> 
>> 

Mime
View raw message