hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Figueiredo <p...@89clouds.com>
Subject Re: Splitting data input to Distcp
Date Wed, 02 May 2012 21:03:52 GMT

On 2 May 2012, at 18:29, Himanshu Vijay wrote:

> Hi,
> 
> I have 100 files each of ~3 GB. I need to distcp them to S3 but copying
> fails because of large size of files. The files are not gzipped so they are
> splittable. Is there a way or property to tell Distcp to first split the
> input files into let's say 200 MB or N lines each before copying to
> destination.
> 

Assuming you're using EMR, use s3distcp:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

In any case, that's strange because S3's limit is 5GB per PUT request; again if you're running
on EMR, try starting your cluster with

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
  --args "-c,fs.s3n.multipart.uploads.enabled=true,-c,fs.s3n.multipart.uploads.split.size=524288000"

(or add those to whatever parameters you currently use).

Going back to plain distcp, I'm not sure about what the -sizelimit option does, as I've never
used it.

If push comes to shove, seeing as you have a Hadoop cluster, running a job to write the files
to S3 with compression enabled is always an option :)

Cheers,

Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting





Mime
View raw message