hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: AW: How to split a big file in HDFS by size
Date Tue, 21 Jun 2011 20:03:33 GMT
Hi,

On Tue, Jun 21, 2011 at 16:14, Mapred Learn <mapred.learn@gmail.com> wrote:
> The problem is when 1 text file goes on HDFS as 60 GB file, one mapper takes
> more than an hour to convert it to sequence file and finally fails.
>
> I was thinking how to split it from the client box before uploading to HDFS.

Have a look at this:

http://stackoverflow.com/questions/3960651/splitting-gzipped-logfiles-without-storing-the-ungzipped-splits-on-disk


> If I read file and split it with filestream. Read() based on size, it takes
> 2 hours to process 1 60 gb file and upload on HDFS as 120 500 mb files.
> Sent from my iPhone
> On Jun 21, 2011, at 2:57 AM, Evert Lammerts <Evert.Lammerts@sara.nl> wrote:
>
> What we did was on not-hadoop hardware. We streamed the file from a storage
> cluster to a single machine and cut it up while streaming the pieces back to
> the storage cluster. That will probably not work for you, unless you have
> the hardware for it. But then still its inefficient.
>
>
>
> You should be able to unzip your file in a MR job. If you still want to use
> compression you can install LZO and rezip the file from within the same job.
> (LZO uses block-compression, which allows Hadoop to process all blocks in
> parallel.) Note that you’ll need enough storage capacity. I don’t have
> example code, but I’m guessing Google can help.
>
>
>
>
>
>
>
> From: Mapred Learn [mailto:mapred.learn@gmail.com]
> Sent: maandag 20 juni 2011 18:09
> To: Niels Basjes; Evert Lammerts
> Subject: Re: AW: How to split a big file in HDFS by size
>
>
>
> Thanks for sharing.
>
>
>
> Could you guys share how are you divinding your 2.7 TB into 10 Gb file each
> on HDFS ? That wud be helpful for me !
>
>
>
>
>
>
>
> On Mon, Jun 20, 2011 at 8:39 AM, Marcos Ortiz <mlortiz@uci.cu> wrote:
>
> Evert Lammerts at Sara.nl did something seemed to your problem, spliting a
> big 2.7 TB file to chunks of 10 GB.
> This work was presented on the BioAssist Programmers' Day on January of this
> year and its name was
> "Large-Scale Data Storage and Processing for Scientist in The Netherlands"
>
> http://www.slideshare.net/evertlammerts
>
> P.D: I sent the message with a copy to him
>
> El 6/20/2011 10:38 AM, Niels Basjes escribió:
>
>
>
> Hi,
>
> On Mon, Jun 20, 2011 at 16:13, Mapred Learn<mapred.learn@gmail.com>  wrote:
>
>
> But this file is a gzipped text file. In this case, it will only go to 1
> mapper than the case if it was
> split into 60 1 GB files which will make map-red job finish earlier than one
> 60 GB file as it will
> Hv 60 mappers running in parallel. Isn't it so ?
>
>
> Yes, that is very true.
>
>
>
> --
>
> Marcos Luís Ortíz Valmaseda
>  Software Engineer (UCI)
>  http://marcosluis2186.posterous.com
>  http://twitter.com/marcosluis2186
>
>
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Mime
View raw message