hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario M <maqueo.ma...@gmail.com>
Subject Re: Remote connection bottleneck?
Date Sun, 26 Sep 2010 01:53:44 GMT
Hi,
what I did was this:

I am working with Cygwin in Windows 7.

- I copied my jar file ITESMCEMdebug.jar to the cluster in the directory
/home/mariom . (I then connected with the ssh and confirmed that it is
there).

- I left the ssh window open and opened another cygwin shell.

- In the new shell, I wento to the hadoop/bin directory in my computer, and
ran:

"bash hadoop jar /home/mariom/ITESMCEMdebug.jar"

(I omitted the arguments just to test, my program outputs the usage
instructions when called without arguments)

- I got this:

Exception in thread "main" java.io.IOException: Error opening job jar:
/home/mariom/ITESMCEMdebug.jar
        at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
Caused by: java.io.FileNotFoundException: \home\mariom\ITESMCEMdebug.jar (El
sistema no puede encontrar la ruta especificada)
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:114)
        at java.util.jar.JarFile.<init>(JarFile.java:133)
        at java.util.jar.JarFile.<init>(JarFile.java:70)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:88)

- If I run my local jar file with "bash hadoop jar ITESMCEMdebug.jar", it
works fine (it outputs the usage instructions).

Also, is it ok that I have to write "bash" everytime? The examples I have
seen just seem to use "hadoop jar etc", I guess this is Cygwin specific,
otherwise it will say bash: hadoop: command not found.

Thanks again :) for your time.

Mario Maqueo
ITESM-CEM



PS: "El sistema no puede encontrar la ruta especificada" = "The system can't
find the specified route" In case the spanish text might confuse you.


2010/9/25 Ted Yu <yuzhihong@gmail.com>

> Mario:
> Can you show us the error when you run the following ?
> "hadoop jar <route where I placed the file with the ssh connection>
> <arguments>"
>
>
>
> Hello,
>>> please excuse my ignorance, but how can I run it from there?
>>> Up to now I've been running the programs with "hadoop jar <localfile>
>>> <arguments>".
>>>
>>> I tried copying the jar to the HDFS and using "hadoop jar <HDFS route>
>>> <arguments>" but that didn't work (file not found), so I went to the ssh
>>> connection and copied the jar to my directory in there, but now I don't know
>>> how to run it from there.  "hadoop jar <route where I placed the file with
>>> the ssh connection> " didn't work.
>>>
>>> I am not very experienced with ssh, so I am sorry if this is basic stuff.
>>>
>>> Thanks,
>>>
>>> Mario Maqueo
>>> ITESM-CEM
>>>
>>> 2010/9/25 Ted Yu <yuzhihong@gmail.com>
>>>
>>> Mario:
>>>> Please produce a jar, place it on one of the servers in the cloud and
>>>> run from there.
>>>>
>>>>
>>>> On Sat, Sep 25, 2010 at 7:46 AM, Raja Thiruvathuru <
>>>> thiruvathuru@gmail.com> wrote:
>>>>
>>>>> MapReduce doesn't download the actual data, but it reads meta-data
>>>>> before it starts MapReduce job
>>>>>
>>>>>
>>>>> On Sat, Sep 25, 2010 at 7:55 AM, Mario M <maqueo.mario@gmail.com>wrote:
>>>>>
>>>>>> Hello,
>>>>>> I am having a problem that might be expected behaviour. I am using
a
>>>>>> cloud with Hadoop remotely through ssh. I have a program that runs
for about
>>>>>> a minute, it processes a 200 MB file using NLineInputFormat and the
user
>>>>>> decides the number of lines to divide the file. However, before the
>>>>>> map-reduce phase starts, the part of the program that divides the
input runs
>>>>>> locally in my computer, which means that if I use a 100 Mbps connection
to
>>>>>> access the cloud, it isn't that much of a problem, but in my house
with a 1
>>>>>> Mbps connection, the program takes about 30 minutes or more to process
this
>>>>>> input. Apparently it is downloading the full 200 MB, processing them
to
>>>>>> decide the byte offsets for dividing the file and sending that to
the cloud.
>>>>>>
>>>>>> This 30 minutes startup time kills all the advantages of using
>>>>>> mapreduce for us. My question is, is this expected behaviour? Is
the
>>>>>> InputFormat phase of the program supposed to run locally and not
in the
>>>>>> cloud? Or am I doing something wrong?  As a contrast, I ran the terasort
>>>>>> Hadoop example for 100 GB and it took 3-4 minutes of startup and
then
>>>>>> started the map phase, which clearly shows that it isn't downloading
all the
>>>>>> information. Terasort doesn't use NLineInputFormat, but still it
has to read
>>>>>> the files to divide them, or not?
>>>>>>
>>>>>> Thank you in advance for your time. :)
>>>>>>
>>>>>> Mario Maqueo
>>>>>> ITESM-CEM
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Raja Thiruvathuru
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Raja Thiruvathuru
>>
> On Sat, Sep 25, 2010 at 12:27 PM, Mario M <maqueo.mario@gmail.com> wrote:
>
>

Mime
View raw message