hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Alekseyev <dnqu...@gmail.com>
Subject Cluster gets overloaded processing large files via streaming
Date Mon, 21 Sep 2009 18:45:13 GMT
Hi all,
I have a streaming job running on ~300 GB of ASCII data in 3 large
files, where the mapper and reducer are Perl scripts.  Mapper does
trivial data cleanup, and reducer builds a hash then iterates over
this hash writing output.  Hash key is the first field in the data,
i.e. the same as the streaming map/reduce key.  However, the nodes
become bogged down to the point of being unusable -- it looks like too
much data is being read into memory.  I am relatively new to hadoop,
so it's not clear to me how to ensure that the reduce tasks don't run
out of memory...
Thanks for any help!

View raw message