hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhavesh Shah <bhavesh25s...@gmail.com>
Subject Re: Doubts related to Amazon EMR
Date Tue, 24 Apr 2012 05:04:01 GMT
Thanks all for their answers.
But I want to ask one more thing that:
1) I have written a program (my task) which contains Hive JDBC code and
code(commands of SQOOP) for importing the tables and exporting too.
    If I create JAR of my program and put it on EMR, then should I need to
do some extra thing like writing mappers/reducers for execution of program?
    Or Just by simply creating the JAR and run it?



-- 
Regards,
Bhavesh Shah


On Tue, Apr 24, 2012 at 7:20 AM, Mark Grover <mgrover@oanda.com> wrote:

> Hi Bhavesh,
>
> To answer your questions:
>
> 1) S3 terminology uses the word "object" and I am sure they have good
> reasons as to why but for us Hive'ers, an S3 object is the same as a file
> stored on S3. The complete path to the file would be what Amazon calls the
> S3 "key" and the corresponding value would be the contents of the file e.g.
> s3://my_bucket/tables/log.txt would be the key and the actual content of
> the file would be S3 object. You can use the AWS web console to create a
> bucket and use tools like S3cmd (http://s3tools.org/s3cmd) to put data
> onto S3.
>
> However, like Kyle said, you don't necessarily need to use S3. S3 is
> typically only used when you want to have a persistent storage of data.
> Most people would store their input logs/files on S3 for Hive processing
> and also store the final aggregations and results on S3 for future
> retrieval. If you are just temporarily loading some data into Hive,
> processing it and exporting it out, you don't have to worry about S3. The
> nodes that form your cluster have ephemeral storage that forms the HDFS.
> You can just use that. The only side effect is that you will loose all your
> data in HDFS once you terminate the cluster. If that's ok, don't worry
> about S3.
>
> EMR instances are basically EC2 instances with some additional setup done
> on them. Transferring data between EC2 and EMR instances should be simple,
> I'd think. If your data is present in EBS volumes, you could look into
> adding an EMR bootstrap action that mounts that same EBS volume onto your
> EMR instances. It might be easier if you can do it without all the fancy
> mounting business though.
>
> Also, keep in mind that there might be costs for data transfers across
> Amazon data centers, you would want to keep your S3 buckets, EMR cluster
> and EC2 instances in the same region, if at all possible. Within the same
> region, there shouldn't be any extra transfer costs.
>
> 2) Yeah, EMR supports custom jars. You can specify them at the time you
> create your cluster. This should require minimal porting changes to your
> jar itself since it runs on Hadoop and Hive which are the same as (well,
> close enough to) what you installed your local cluster vs. what's installed
> on EMR.
>
> 3) Like Kyle said, Sqoop with EMR should be OK.
>
> Good luck!
> Mark
>
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgrover@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Kyle Mulka" <kyle.mulka@gmail.com>
> To: user@hive.apache.org
> Cc: user@hive.apache.org, dev@hive.apache.org
> Sent: Monday, April 23, 2012 10:55:36 AM
> Subject: Re: Doubts related to Amazon EMR
>
>
> It is possible to install Sqoop on AWS EMR. I've got some scripts I can
> publish later. You are not required to use S3 to store files and can use
> the local (temporary) HDFS instead. After you have Sqoop installed, you can
> import your data with it into HDFS, run your calculations in HDFS, then
> export your data back out using Sqoop again.
>
> --
> Kyle Mulka
> http://www.kylemulka.com
>
> On Apr 23, 2012, at 8:42 AM, Bhavesh Shah < bhavesh25shah@gmail.com >
> wrote:
>
>
>
>
>
>
>
>
> Hello all,
>
>
> I want to deploy my task on Amazon EMR. But as I am new to Amazon Web
> Services I am confused in understanding the concepts.
>
> My Use Case:
>
> I want to import the large data from EC2 through SQOOP into the Hive.
> Imported data in Hive will get processed in Hive by applying some algorithm
> and will generate some result (in table form, in Hive only). And generated
> result will be exported back to Ec2 again through SQOOP only.
>
> I am new to Amazon Web Services and want to implement this use case with
> the help of AWS EMR. I have implemented it on local machine.
>
> I have read some links related to AWS EMR for launching the instance and
> about what is EMR, How it works and etc... I have some doubts about EMR
> like:
>
>
> 1) EMR uses S3 Buckets, which holds Input and Output data Hadoop
> Processing (in the form of Objects). ---> I didn't get How to store the
> data in the form of Objects on S3 (My data will be files)
>
> 2) As already said I have implemented a task for my use case in Java. So
> If I create the JAR of my program and create the Job Flow with Custom JAR.
> Will it be possible to implement like this or do need to do some thing
> extra for that?
>
> 3) As I said in my Use Case that I want to export my result back to Ec2
> with the help of SQOOP. Does EMR have support of SQOOP?
>
>
>
>
> If you have any kind of idea related to AWS, please reply me with your
> answer as soon as possible. I want to do this as early as possible.
>
> many Thanks.
>
>
> --
> Regards,
> Bhavesh Shah
>

Mime
View raw message