hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex McLintock <alex.mclint...@gmail.com>
Subject Re: Cluster gets overloaded processing large files via streaming
Date Mon, 21 Sep 2009 19:00:54 GMT
I think the default chunk size you are referring to is about 64Mb.
This was chosen as something like a single read off a disk.

I for one am a big perl fan but I am not happy about 64Mb of text
being read into a perl hash. Hashes are memory wasteful - in
preference to speed.
So my verdict is to rethink your perl algorithm. Can you not process
some of the data and then throw it away in the perl script?

(Am happy to discuss optimising perl but lets do it offlist if it
isn't relevant to hadoop)


2009/9/21 Leo Alekseyev <dnquark@gmail.com>:
> Hi all,
> I have a streaming job running on ~300 GB of ASCII data in 3 large
> files, where the mapper and reducer are Perl scripts.  Mapper does
> trivial data cleanup, and reducer builds a hash then iterates over
> this hash writing output.  Hash key is the first field in the data,
> i.e. the same as the streaming map/reduce key.  However, the nodes
> become bogged down to the point of being unusable -- it looks like too
> much data is being read into memory.  I am relatively new to hadoop,
> so it's not clear to me how to ensure that the reduce tasks don't run
> out of memory...
> Thanks for any help!

View raw message