mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Clustering from DB
Date Mon, 20 Jul 2009 18:58:14 GMT
Update: I tried running the cluster with two particular nodes, and I
got the same errors. So, I'm thinking maybe it has something to do
with the connection to that PC (hadoop-slave01, aka 'orco').

Here's what the jobtracker log shows from the master:

2009-07-20 15:46:22,366 INFO org.apache.hadoop.mapred.JobInProgress:
Failed fetch notification #1 for task
attempt_200907201540_0001_m_000001_0
2009-07-20 15:46:28,113 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_200907201540_0001_r_000002_0: Shuffle Error:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
2009-07-20 15:46:28,114 INFO org.apache.hadoop.mapred.JobTracker:
Adding task (cleanup)'attempt_200907201540_0001_r_000002_0' to tip
task_200907201540_0001_r_000002, for tracker
'tracker_orco.3kh.net:localhost/127.0.0.1:59814'
2009-07-20 15:46:31,116 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_200907201540_0001_r_000000_0: Shuffle Error:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Why does it show 'orco.3kh.net:localhost'? I know it's in /etc/hosts/,
but I didn't expect to take into account any other lines apart from
the ones specifying IPs for masters and slaves. Is it attempting to
connect to itself and failing?


On Mon, Jul 20, 2009 at 1:30 PM, nfantone<nfantone@gmail.com> wrote:
> Ok, here's my failure report:
>
> I can't get more than two nodes working in the cluster. With just a
> master and a slave, everything seems to go smoothly. However, if I add
> a third datanode (being the master itself, also a datanode) I keep
> getting this error while running the wordcount example, which I'm
> using to test the setup:
>
> 09/07/20 12:51:45 INFO mapred.JobClient:  map 100% reduce 17%
> 09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
> attempt_200907201251_0001_m_000004_0, Status : FAILED
> Too many fetch-failures
> 09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
> route to host
>
> While the mapping completes, the reduce task gets stuck at around 16%
> every time. I have googled the error message and read some responses
> from this list and other related forums, and it seems to be a firewall
> issue or something about ports not being opened; yet, this is not my
> case: firewall has been disabled on every node and connection between
> them (to and from) seems to be fine.
>
> Here's my /etc/hosts files for each node:
>
>  (master)
> 127.0.0.1       localhost
> 127.0.1.1       mauroN-Linux
> 192.168.200.20  hadoop-master
> 192.168.200.90  hadoop-slave00
> 192.168.200.162 hadoop-slave01
>
> (slave00)
> 127.0.0.1       localhost
> 127.0.1.1       tagore
> 192.168.200.20  hadoop-master
> 192.168.200.90  hadoop-slave00
> 192.168.200.162 hadoop-slave01
>
> (slave01)
> 127.0.0.1       localhost
> 127.0.1.1       orco.3kh.net orco localhost.localdomain
> 192.168.200.20  hadoop-master
> 192.168.200.90  hadoop-slave00
> 192.168.200.162 hadoop-slave01
>
> And .xml conf files, which are the same for each node (just relevant lines):
>
> (core-site.xml)
> <name>hadoop.tmp.dir</name>
> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>
>
> <name>fs.default.name</name>
> <value>hdfs://hadoop-master:54310/</value>
> <final>true</final>
>
> (mapred-site.xml)
> <name>mapred.job.tracker</name>
> <value>hdfs://hadoop-master:54311/</value>
> <final>true</final>
>
> <name>mapred.map.tasks</name>
> <value>31</value>
>
> <name>mapred.reduce.tasks</name>
> <value>6</value>
>
> (hdfs-site.xml)
> <name>dfs.replication</name>
> <value>3</value>
>
> I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
> 3, the error does not pop up, but it takes quite a long time to finish
> (more than the time it takes for a single machine to finish it). I
> have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
> 1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
> the datanodes logs, I could post it. I'm running out of ideas... and
> in need of enlightenment.
>
> On Thu, Jul 16, 2009 at 9:39 AM, nfantone<nfantone@gmail.com> wrote:
>> I really appreciate all your suggestions, but from where I am and
>> considering the place I work at (a rather small office in Argentina)
>> these things aren't that affordable (monetarily and bureaucratically
>> speaking). That being said, I managed to get my hands around some more
>> equipment and I may be able to set up a small cluster of three or four
>> nodes - all running in a local network with Ubuntu. What I should
>> learn now is exactly how to configure all that is needed in order to
>> create it, as I have virtually no idea, nor experience in this kind of
>> tasks. Luckily, goggling led me to some tutorials and documentation on
>> the subject. I'll be following this guide for now:
>>
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>>
>> I'll let know what comes out this (surely, something on the messy side
>> of things). Any more suggestions/ideas are more than welcome. Many
>> thanks, again.
>>
>

Mime
View raw message