hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Parks <davidpark...@yahoo.com>
Subject Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
Date Fri, 29 Mar 2013 14:34:06 GMT
CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used it primarily with
1.0.3, which is what AWS uses, so I presume that's what it's tested on.

Himanish Kushary <himanish@gmail.com> wrote:

>Thanks Dave.
>
>
>I had already tried using the s3distcp jar. But got stuck on the below error,which made
me think that this is something specific to Amazon hadoop distribution.
>
>
>Exception in thread "Thread-28" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream 
>
>
>Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it is not present
on the CDH4 (my local env) hadoop jars.
>
>
>Could you suggest how I could get around this issue. One option could be using the amazon
specific jars but then probably I would need to get all the jars ( else it could cause version
mismatch errors for HDFS - NoSuchMethodError etc etc ) 
>
>
>Appreciate your help regarding this.
>
>
>- Himanish
>
>
>
>
>On Fri, Mar 29, 2013 at 1:41 AM, David Parks <davidparks21@yahoo.com> wrote:
>
>None of that complexity, they distribute the jar publicly (not the source, but the jar).
You can just add this to your libjars: s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar
>
> 
>
>No VPN or anything, if you can access the internet you can get to S3. 
>
> 
>
>Follow their docs here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
>
> 
>
>Doesn’t matter where you’re Hadoop instance is running.
>
> 
>
>Here’s an example of code/parameters I used to run it from within another Tool, it’s
a Tool, so it’s actually designed to run from the Hadoop command line normally.
>
> 
>
>       ToolRunner.run(getConf(), new S3DistCp(), new String[] {
>
>              "--src",             "/frugg/image-cache-stage2/",
>
>              "--srcPattern",      ".*part.*",
>
>              "--dest",            "s3n://fruggmapreduce/results-"+env+"/"
+ JobUtils.isoDate + "/output/itemtable/", 
>
>              "--s3Endpoint",      "s3.amazonaws.com"        
});
>
> 
>
>Watch the “srcPattern”, make sure you have that leading `.*`, that one threw me for
a loop once.
>
> 
>
>Dave
>
> 
>
> 
>
>From: Himanish Kushary [mailto:himanish@gmail.com] 
>Sent: Thursday, March 28, 2013 5:51 PM
>To: user@hadoop.apache.org
>Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
>
> 
>
>Hi Dave,
>
> 
>
>Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could you please
provide some details on how i could use the s3distcp from amazon to transfer data from our
on-premises hadoop to amazon s3. Wouldn't some kind of VPN be needed between the Amazon EMR
instance and our on-premises hadoop instance ? Did you mean use the jar from amazon on our
local server ?
>
> 
>
>Thanks
>
>On Thu, Mar 28, 2013 at 3:56 AM, David Parks <davidparks21@yahoo.com> wrote:
>
>Have you tried using s3distcp from amazon? I used it many times to transfer 1.5TB between
S3 and Hadoop instances. The process took 45 min, well over the 10min timeout period you’re
running into a problem on.
>
> 
>
>Dave
>
> 
>
> 
>
>From: Himanish Kushary [mailto:himanish@gmail.com] 
>Sent: Thursday, March 28, 2013 10:54 AM
>To: user@hadoop.apache.org
>Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
>
> 
>
>Hello,
>
> 
>
>I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using the distcp
utility.There are aaround 2200 files distributed over 15 directories.The max individual file
size is approx 50 MB.
>
> 
>
>The distcp mapreduce job keeps on failing with this error 
>
> 
>
>"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600 seconds. Killing!"
 
>
> 
>
>and in the task attempt logs I can see lot of INFO messages like 
>
> 
>
>"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception (java.io.IOException)
caught when processing request: Resetting to invalid mark"
>
> 
>
>I am thinking either transferring individual folders instead of the entire 70 GB folders
as a workaround or as another option increasing the "mapred.task.timeout" parameter to something
like 6-7 hour ( as the avg rate of transfer to S3 seems to be 5 MB/s).Is there any other better
option to increase the throughput for transferring bulk data from HDFS to S3 ?  Looking forward
for suggestions.
>
> 
>
> 
>
>-- 
>Thanks & Regards
>Himanish 
>
>
>
> 
>
>-- 
>Thanks & Regards
>Himanish 
>
>
>
>
>-- 
>Thanks & Regards
>Himanish 
>
Mime
View raw message