hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Goodhope <kengoodh...@gmail.com>
Subject Re: newbie - job failing at reduce
Date Wed, 30 Jun 2010 16:58:14 GMT
Have you increased your file handle limits?  You can check this with a
'ulimit -n' call.   If you are still at 1024, then you will want to increase
the limit to something quite a bit higher.

On Wed, Jun 30, 2010 at 9:40 AM, Siddharth Karandikar <
siddharth.karandikar@gmail.com> wrote:

> Yeah. SSH is working as mentioned in the docs. Even directory
> mentioned for 'mapred.local.dir' has enough space.
>
> - Siddharth
>
> On Wed, Jun 30, 2010 at 10:01 PM, Chris Collord <ccollord@lanl.gov> wrote:
> > Interesting that the reduce phase makes it that far before failing!
> > Are you able to SSH (without a password) into the failing node?  Any
> > possible folder permissions issues?
> > ~Chris
> >
> > On 06/30/2010 10:26 AM, Siddharth Karandikar wrote:
> >>
> >> Hey Chris,
> >> Thanks for your inputs. I have tried most of the stuff, but will
> >> surely go though tutorial you have pointed out. May be I will get some
> >> hint there.
> >>
> >> Interestingly, while experimenting with it more, I noticed that, if
> >> small size input file is there (50MBs) the job works perfectly fine.
> >> If I give bigger input, it starts hanging @ reduce tasks. Map phase
> >> always finishes 100%.
> >>
> >> - Siddharth
> >>
> >>
> >> On Wed, Jun 30, 2010 at 9:11 PM, Chris Collord<ccollord@lanl.gov>
>  wrote:
> >>
> >>>
> >>> Hi Siddharth,
> >>> I'm VERY new to this myself, but here are a few thoughts (since nobody
> >>> else
> >>> is responding!).
> >>> -You might want to set dfs.replication to 2.  I have read that for
> >>> clusters
> >>> <  8, you should have replication set to 2 machines.  8+ node clusters
> >>> use 3.
> >>>  This may make your cluster work, but it won't fix your problem.
> >>> -Run a "bin/hadoop dfsadmin -report" with the hadoop cluster running
> and
> >>> see
> >>> what it shows for your failing node.
> >>> -Check your logs/ folder for "datanode" logs and see if there's
> anything
> >>> useful in there before the error you're getting.
> >>> -You might try reformatting your hdfs, if you don't have anything
> >>> important
> >>> in there.  "bin/hadoop namenode -format".  (Note: this has caused
> >>> problems
> >>> for me in the past with namenode ID's, see the bottom on the link for
> >>> Michael Noll's tutorial if that happens)
> >>>
> >>> You should check out Michael Noll's tutorial for all the little
> details:
> >>>
> >>>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >>>
> >>> Let me know if anything helps!
> >>> ~Chris
> >>>
> >>>
> >>>
> >>> On 06/30/2010 04:02 AM, Siddharth Karandikar wrote:
> >>>
> >>>>
> >>>> Anyone?
> >>>>
> >>>>
> >>>> On Tue, Jun 29, 2010 at 8:41 PM, Siddharth Karandikar
> >>>> <siddharth.karandikar@gmail.com>    wrote:
> >>>>
> >>>>
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> I am new to Hadoop, but by reading online docs and other resource,
I
> >>>>> have moved ahead and now trying to run a cluster of 3 nodes.
> >>>>> Before doing this, tried my program on standalone and pseudo systems
> >>>>> and thats working fine.
> >>>>>
> >>>>> Now the issue that I am facing - mapping phase works correctly.
While
> >>>>> doing reduce, I am seeing following error on one of the nodes -
> >>>>>
> >>>>> 2010-06-29 14:35:01,848 WARN org.apache.hadoop.mapred.TaskTracker:
> >>>>> getMapOutput(attempt_201006291958_0001_m_000008_0,0) failed :
> >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not
find
> >>>>>
> >>>>>
> >>>>>
> taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_000008_0/output/file.out.index
> >>>>> in any of the configured local directories
> >>>>>
> >>>>> Lets say this is @ Node1. But there is no such directory named
> >>>>>
> >>>>>
> >>>>>
> 'taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_000008_0'
> >>>>> under /tmp/mapred/local/taskTracker/ on Node1. Interestingly, this
> >>>>> directory is available on Node2 (or Node3). Tried running the job
> >>>>> multiple times, but its always failing while reducing. Same error.
> >>>>>
> >>>>> I have configured /tmp/mapred/local on each node from
> mapred-site.xml.
> >>>>>
> >>>>> I really don't understand why mappers are misplacing these files?
Or
> >>>>> am I missing something in configuration?
> >>>>>
> >>>>> If someone wants to look @ configurations, I have pasted that below.
> >>>>>
> >>>>> Thanks,
> >>>>> Siddharth
> >>>>>
> >>>>>
> >>>>> Configurations
> >>>>> ==========
> >>>>>
> >>>>> conf/core-site.xml
> >>>>> ---------------------------
> >>>>>
> >>>>> <?xml version="1.0"?>
> >>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >>>>> <configuration>
> >>>>>  <property>
> >>>>>    <name>fs.default.name</name>
> >>>>>    <value>hdfs://192.168.2.115/</value>
> >>>>>  </property>
> >>>>> </configuration>
> >>>>>
> >>>>>
> >>>>> conf/hdfs-site.xml
> >>>>> --------------------------
> >>>>> <?xml version="1.0"?>
> >>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >>>>> <configuration>
> >>>>>  <property>
> >>>>>    <name>fs.default.name</name>
> >>>>>    <value>hdfs://192.168.2.115</value>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>dfs.data.dir</name>
> >>>>>    <value>/home/siddharth/hdfs/data</value>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>dfs.name.dir</name>
> >>>>>    <value>/home/siddharth/hdfs/name</value>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>dfs.replication</name>
> >>>>>    <value>3</value>
> >>>>>  </property>
> >>>>> </configuration>
> >>>>>
> >>>>> conf/mapred-site.xml
> >>>>> ------------------------------
> >>>>> <?xml version="1.0"?>
> >>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >>>>> <configuration>
> >>>>>  <property>
> >>>>>    <name>mapred.job.tracker</name>
> >>>>>    <value>192.168.2.115:8021</value>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>mapred.local.dir</name>
> >>>>>    <value>/tmp/mapred/local</value>
> >>>>>    <final>true</final>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>mapred.system.dir</name>
> >>>>>    <value>hdfs://192.168.2.115/maperdsystem</value>
> >>>>>    <final>true</final>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>mapred.tasktracker.map.tasks.maximum</name>
> >>>>>    <value>4</value>
> >>>>>    <final>true</final>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>mapred.tasktracker.reduce.tasks.maximum</name>
> >>>>>    <value>4</value>
> >>>>>    <final>true</final>
> >>>>>  </property>
> >>>>>  <property>
> >>>>>    <name>mapred.child.java.opts</name>
> >>>>>    <value>-Xmx512m</value>
> >>>>>    <!-- Not marked as final so jobs can include JVM debugging
options
> >>>>> -->
> >>>>>  </property>
> >>>>> </configuration>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>> --
> >>> ------------------------------
> >>> Chris Collord, ACS-PO 9/80 A
> >>> ------------------------------
> >>>
> >>>
> >>>
> >
> >
> > --
> > ------------------------------
> > Chris Collord, ACS-PO 9/80 A
> > ------------------------------
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message