hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanshu Vijay <himansh...@gmail.com>
Subject Re: Splitting data input to Distcp
Date Thu, 03 May 2012 22:47:11 GMT
Pedro,

Thanks for the response. Unfortunately I am running it on in-house cluster
and from there I need to upload to S3.

-Himanshu

On Wed, May 2, 2012 at 2:03 PM, Pedro Figueiredo <pfig@89clouds.com> wrote:

>
> On 2 May 2012, at 18:29, Himanshu Vijay wrote:
>
> > Hi,
> >
> > I have 100 files each of ~3 GB. I need to distcp them to S3 but copying
> > fails because of large size of files. The files are not gzipped so they
> are
> > splittable. Is there a way or property to tell Distcp to first split the
> > input files into let's say 200 MB or N lines each before copying to
> > destination.
> >
>
> Assuming you're using EMR, use s3distcp:
>
>
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
>
> In any case, that's strange because S3's limit is 5GB per PUT request;
> again if you're running on EMR, try starting your cluster with
>
> --bootstrap-action
> s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
>  --args
> "-c,fs.s3n.multipart.uploads.enabled=true,-c,fs.s3n.multipart.uploads.split.size=524288000"
>
> (or add those to whatever parameters you currently use).
>
> Going back to plain distcp, I'm not sure about what the -sizelimit option
> does, as I've never used it.
>
> If push comes to shove, seeing as you have a Hadoop cluster, running a job
> to write the files to S3 with compression enabled is always an option :)
>
> Cheers,
>
> Pedro Figueiredo
> Skype: pfig.89clouds
> http://89clouds.com/ - Big Data Consulting
>
>
>
>
>


-- 
-Himanshu Vijay

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message