hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mapred Learn <mapred.le...@gmail.com>
Subject Re: AW: How to split a big file in HDFS by size
Date Mon, 20 Jun 2011 14:13:49 GMT
But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case
if it was split into 60 1 GB files which will make map-red job finish earlier than one 60
GB file as it will Hv 60 mappers running in parallel. Isn't it so ?

Sent from my iPhone

On Jun 20, 2011, at 12:59 AM, Christoph Schmitz <Christoph.Schmitz@1und1.de> wrote:

> Simple answer: don't. The Hadoop framework will take care of that for you and split the
file. The logical 60 GB file you see in the HDFS actually *is* split into smaller chunks (default
size is 64 MB) and physically distributed across the cluster.
> 
> Regards,
> Christoph
> 
> -----Urspr√ľngliche Nachricht-----
> Von: Mapred Learn [mailto:mapred.learn@gmail.com] 
> Gesendet: Montag, 20. Juni 2011 08:36
> An: mapreduce-user@hadoop.apache.org
> Betreff: Re: How to split a big file in HDFS by size
> 
> Hi Christopher,
> If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then run a map-red
job on those 60 text fixed length files ? If yes, do you have any idea how to do this ?
> 
> 
> 
> 
> On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz <Christoph.Schmitz@1und1.de>
wrote:
> 
> 
>    JJ,
>    
>    uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow.
If possible, try to get the files in smaller chunks where they are created, and upload them
in parallel with a simple MapReduce job that only passes the data through (i.e. uses the standard
Mapper and Reducer classes). This job should read from your local input directory and output
into the HDFS.
>    
>    If you cannot split the 60 GB where they are created, IMHO there is not much you can
do. If you have a file format with, say, fixed length records, you could try to create your
own InputFormat that splits the file logically without creating the actual splits locally
(which would be too costly, I assume).
>    
>    The performance of reading in parallel, though, will depend to a large extent on the
nature of your local storage. If you have a single hard drive, reading in parallel might actually
be slower than reading serially because it means a lot of random disk accesses.
>    
>    Regards,
>    Christoph
>    
>    -----Urspr√ľngliche Nachricht-----
>    Von: Mapred Learn [mailto:mapred.learn@gmail.com]
>    Gesendet: Montag, 20. Juni 2011 06:02
>    An: mapreduce-user@hadoop.apache.org; cdh-user@cloudera.org
>    Betreff: How to split a big file in HDFS by size
>    
> 
>    Hi,
>    I am trying to upload text files in size 60 GB or more.
>    I want to split these files into smaller files of say 1 GB each so that I can run
further map-red jobs on it.
>    
>    Anybody has any idea how can I do this ?
>    Thanks a lot in advance ! Any ideas are greatly appreciated !
>    
>    -JJ
>    
> 
> 

Mime
View raw message