hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Schmitz <Christoph.Schm...@1und1.de>
Subject AW: How to split a big file in HDFS by size
Date Mon, 20 Jun 2011 06:28:26 GMT
JJ,

uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible,
try to get the files in smaller chunks where they are created, and upload them in parallel
with a simple MapReduce job that only passes the data through (i.e. uses the standard Mapper
and Reducer classes). This job should read from your local input directory and output into
the HDFS.

If you cannot split the 60 GB where they are created, IMHO there is not much you can do. If
you have a file format with, say, fixed length records, you could try to create your own InputFormat
that splits the file logically without creating the actual splits locally (which would be
too costly, I assume). 

The performance of reading in parallel, though, will depend to a large extent on the nature
of your local storage. If you have a single hard drive, reading in parallel might actually
be slower than reading serially because it means a lot of random disk accesses.

Regards,
Christoph

-----Urspr√ľngliche Nachricht-----
Von: Mapred Learn [mailto:mapred.learn@gmail.com] 
Gesendet: Montag, 20. Juni 2011 06:02
An: mapreduce-user@hadoop.apache.org; cdh-user@cloudera.org
Betreff: How to split a big file in HDFS by size

Hi,
I am trying to upload text files in size 60 GB or more.
I want to split these files into smaller files of say 1 GB each so that I can run further
map-red jobs on it.

Anybody has any idea how can I do this ?
Thanks a lot in advance ! Any ideas are greatly appreciated !

-JJ

Mime
View raw message