hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario M <maqueo.ma...@gmail.com>
Subject Remote connection bottleneck?
Date Sat, 25 Sep 2010 11:55:41 GMT
I am having a problem that might be expected behaviour. I am using a cloud
with Hadoop remotely through ssh. I have a program that runs for about a
minute, it processes a 200 MB file using NLineInputFormat and the user
decides the number of lines to divide the file. However, before the
map-reduce phase starts, the part of the program that divides the input runs
locally in my computer, which means that if I use a 100 Mbps connection to
access the cloud, it isn't that much of a problem, but in my house with a 1
Mbps connection, the program takes about 30 minutes or more to process this
input. Apparently it is downloading the full 200 MB, processing them to
decide the byte offsets for dividing the file and sending that to the cloud.

This 30 minutes startup time kills all the advantages of using mapreduce for
us. My question is, is this expected behaviour? Is the InputFormat phase of
the program supposed to run locally and not in the cloud? Or am I doing
something wrong?  As a contrast, I ran the terasort Hadoop example for 100
GB and it took 3-4 minutes of startup and then started the map phase, which
clearly shows that it isn't downloading all the information. Terasort
doesn't use NLineInputFormat, but still it has to read the files to divide
them, or not?

Thank you in advance for your time. :)

Mario Maqueo

View raw message