hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Wong <sw...@netflix.com>
Subject RE: What is best way to load data into hive tables/hadoop file system
Date Wed, 02 Nov 2011 02:27:17 GMT
Run multiple concurrent LOAD DATAs, one per file.

Alternatively, if your TT nodes have access to the source file system, use a map-only Hadoop
job, such as distcp.


From: Shantian Purkad [mailto:shantian_purkad@yahoo.com]
Sent: Monday, October 31, 2011 4:34 PM
To: common-user@hadoop.apache.org; user@hive.apache.org
Subject: What is best way to load data into hive tables/hadoop file system

Hello,

We have multiple terabytes of data (currently in gz format approx size 2GB per file). What
is best way to load that data into Hadoop?

We have seen that (especially when loaded using hive's load data local inpath ....) to load
a gz file it takes around 12 seconds and when we decompress it (around 4~5GB) it takes 8 minutes
to load the file.

We want these files to be processed using multiple mappers on the Hadoop and not with singles.

What would be best way to load these files in Hive/hdfs so that it takes less time to load
as well as use multiple mappers to process the files.


Thanks and Regards,
Shantian

Mime
View raw message