hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Hadoop <hadoop...@yahoo.com>
Subject Re: Oracle to HDFS through Sqoop and a Hive External Table
Date Sun, 03 Nov 2013 16:57:08 GMT
Manish,

Thanks for reply.


1. Load to Hdfs, beware of Sqoop error handling, as its a mapreduce based framework, so if
1 mapper fails it might happen that you get partial data.
So do you say that - if I can handle errors in Sqoop, going for 100 HDFS folders/files - is
it OK ?

2. Create partition based on date and hour, if customer table has some date or timestamp column.
I cannot rely on date or timestamp column. So can I go with Customer ID ?

3. Think about file format also, as that will affect the load and query time.
Can you please suggest a file format that I have to use ?

4. Think about compression as well before hand, as that will govern the data split, and performance
of your queries as well.
Does compression increases or reduces performance ? Isn't the compression advantage is saving
in storage? 

- Raj



On Sunday, November 3, 2013 11:03 AM, manish.hadoop.work <manish.hadoop.work@gmail.com>
wrote:
 
1. Load to Hdfs, beware of Sqoop error handling, as its a mapreduce based framework, so if
1 mapper fails it might happen that you get partial data.

2. Create partition based on date and hour, if customer table has some date or timestamp column.

3. Think about file format also, as that will affect the load and query time.

4. Think about compression as well before hand, as that will govern the data split, and performance
of your queries as well.

Regards,
Manish



Sent from my T-Mobile 4G LTE Device


-------- Original message --------
From: Raj Hadoop <hadoopraj@yahoo.com> 
Date: 11/03/2013  7:39 AM  (GMT-08:00) 
To: Hive <user@hive.apache.org>,Sqoop <user@sqoop.apache.org>,User <user@hadoop.apache.org>

Subject: Oracle to HDFS through Sqoop and a Hive External Table 



Hi,

I am sending this to the three dist-lists of Hadoop, Hive and Sqoop as this question is closely
related to all the three areas.

I have this requirement.

I have a big table in Oracle (about 60 million rows - Primary Key Customer Id). I want to
bring this to HDFS and then create
a Hive external table. My requirement is running queries on this Hive table (at this time
i do not know what queries i would be running).

Is the following a good design for the above problem ? Any pros and cons of this.


1) Load the table to HDFS using Sqoop into multiple folders (divide Customer Id's into 100
segments).
2) Create Hive external partition table based on the above 100 HDFS directories.


Thanks,
Raj
Mime
View raw message