hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddharth Karandikar <siddharth.karandi...@gmail.com>
Subject Re: newbie - job failing at reduce
Date Fri, 02 Jul 2010 11:57:36 GMT
I am running with 10240 now and jobs look to be working fine. Need to
confirm this by reverting back to 1024 and see jobs failing.  :)

Thanks!



On Wed, Jun 30, 2010 at 10:59 PM, Siddharth Karandikar
<siddharth.karandikar@gmail.com> wrote:
> Yeah. Looks like its set to 1024 right now. Change that to say 10
> times more and run the setup again.
> Thanks Ken!
>
> - Siddharth
>
> On Wed, Jun 30, 2010 at 10:28 PM, Ken Goodhope <kengoodhope@gmail.com> wrote:
>> Have you increased your file handle limits?  You can check this with a
>> 'ulimit -n' call.   If you are still at 1024, then you will want to increase
>> the limit to something quite a bit higher.
>>
>> On Wed, Jun 30, 2010 at 9:40 AM, Siddharth Karandikar <
>> siddharth.karandikar@gmail.com> wrote:
>>
>>> Yeah. SSH is working as mentioned in the docs. Even directory
>>> mentioned for 'mapred.local.dir' has enough space.
>>>
>>> - Siddharth
>>>
>>> On Wed, Jun 30, 2010 at 10:01 PM, Chris Collord <ccollord@lanl.gov> wrote:
>>> > Interesting that the reduce phase makes it that far before failing!
>>> > Are you able to SSH (without a password) into the failing node?  Any
>>> > possible folder permissions issues?
>>> > ~Chris
>>> >
>>> > On 06/30/2010 10:26 AM, Siddharth Karandikar wrote:
>>> >>
>>> >> Hey Chris,
>>> >> Thanks for your inputs. I have tried most of the stuff, but will
>>> >> surely go though tutorial you have pointed out. May be I will get some
>>> >> hint there.
>>> >>
>>> >> Interestingly, while experimenting with it more, I noticed that, if
>>> >> small size input file is there (50MBs) the job works perfectly fine.
>>> >> If I give bigger input, it starts hanging @ reduce tasks. Map phase
>>> >> always finishes 100%.
>>> >>
>>> >> - Siddharth
>>> >>
>>> >>
>>> >> On Wed, Jun 30, 2010 at 9:11 PM, Chris Collord<ccollord@lanl.gov>
>>>  wrote:
>>> >>
>>> >>>
>>> >>> Hi Siddharth,
>>> >>> I'm VERY new to this myself, but here are a few thoughts (since
nobody
>>> >>> else
>>> >>> is responding!).
>>> >>> -You might want to set dfs.replication to 2.  I have read that
for
>>> >>> clusters
>>> >>> <  8, you should have replication set to 2 machines.  8+ node
clusters
>>> >>> use 3.
>>> >>>  This may make your cluster work, but it won't fix your problem.
>>> >>> -Run a "bin/hadoop dfsadmin -report" with the hadoop cluster running
>>> and
>>> >>> see
>>> >>> what it shows for your failing node.
>>> >>> -Check your logs/ folder for "datanode" logs and see if there's
>>> anything
>>> >>> useful in there before the error you're getting.
>>> >>> -You might try reformatting your hdfs, if you don't have anything
>>> >>> important
>>> >>> in there.  "bin/hadoop namenode -format".  (Note: this has caused
>>> >>> problems
>>> >>> for me in the past with namenode ID's, see the bottom on the link
for
>>> >>> Michael Noll's tutorial if that happens)
>>> >>>
>>> >>> You should check out Michael Noll's tutorial for all the little
>>> details:
>>> >>>
>>> >>>
>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>>> >>>
>>> >>> Let me know if anything helps!
>>> >>> ~Chris
>>> >>>
>>> >>>
>>> >>>
>>> >>> On 06/30/2010 04:02 AM, Siddharth Karandikar wrote:
>>> >>>
>>> >>>>
>>> >>>> Anyone?
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Jun 29, 2010 at 8:41 PM, Siddharth Karandikar
>>> >>>> <siddharth.karandikar@gmail.com>    wrote:
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> Hi All,
>>> >>>>>
>>> >>>>> I am new to Hadoop, but by reading online docs and other
resource, I
>>> >>>>> have moved ahead and now trying to run a cluster of 3 nodes.
>>> >>>>> Before doing this, tried my program on standalone and pseudo
systems
>>> >>>>> and thats working fine.
>>> >>>>>
>>> >>>>> Now the issue that I am facing - mapping phase works correctly.
While
>>> >>>>> doing reduce, I am seeing following error on one of the
nodes -
>>> >>>>>
>>> >>>>> 2010-06-29 14:35:01,848 WARN org.apache.hadoop.mapred.TaskTracker:
>>> >>>>> getMapOutput(attempt_201006291958_0001_m_000008_0,0) failed
:
>>> >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could
not find
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_000008_0/output/file.out.index
>>> >>>>> in any of the configured local directories
>>> >>>>>
>>> >>>>> Lets say this is @ Node1. But there is no such directory
named
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> 'taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_000008_0'
>>> >>>>> under /tmp/mapred/local/taskTracker/ on Node1. Interestingly,
this
>>> >>>>> directory is available on Node2 (or Node3). Tried running
the job
>>> >>>>> multiple times, but its always failing while reducing. Same
error.
>>> >>>>>
>>> >>>>> I have configured /tmp/mapred/local on each node from
>>> mapred-site.xml.
>>> >>>>>
>>> >>>>> I really don't understand why mappers are misplacing these
files? Or
>>> >>>>> am I missing something in configuration?
>>> >>>>>
>>> >>>>> If someone wants to look @ configurations, I have pasted
that below.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Siddharth
>>> >>>>>
>>> >>>>>
>>> >>>>> Configurations
>>> >>>>> ==========
>>> >>>>>
>>> >>>>> conf/core-site.xml
>>> >>>>> ---------------------------
>>> >>>>>
>>> >>>>> <?xml version="1.0"?>
>>> >>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>> >>>>> <configuration>
>>> >>>>>  <property>
>>> >>>>>    <name>fs.default.name</name>
>>> >>>>>    <value>hdfs://192.168.2.115/</value>
>>> >>>>>  </property>
>>> >>>>> </configuration>
>>> >>>>>
>>> >>>>>
>>> >>>>> conf/hdfs-site.xml
>>> >>>>> --------------------------
>>> >>>>> <?xml version="1.0"?>
>>> >>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>> >>>>> <configuration>
>>> >>>>>  <property>
>>> >>>>>    <name>fs.default.name</name>
>>> >>>>>    <value>hdfs://192.168.2.115</value>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>dfs.data.dir</name>
>>> >>>>>    <value>/home/siddharth/hdfs/data</value>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>dfs.name.dir</name>
>>> >>>>>    <value>/home/siddharth/hdfs/name</value>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>dfs.replication</name>
>>> >>>>>    <value>3</value>
>>> >>>>>  </property>
>>> >>>>> </configuration>
>>> >>>>>
>>> >>>>> conf/mapred-site.xml
>>> >>>>> ------------------------------
>>> >>>>> <?xml version="1.0"?>
>>> >>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>> >>>>> <configuration>
>>> >>>>>  <property>
>>> >>>>>    <name>mapred.job.tracker</name>
>>> >>>>>    <value>192.168.2.115:8021</value>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>mapred.local.dir</name>
>>> >>>>>    <value>/tmp/mapred/local</value>
>>> >>>>>    <final>true</final>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>mapred.system.dir</name>
>>> >>>>>    <value>hdfs://192.168.2.115/maperdsystem</value>
>>> >>>>>    <final>true</final>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>mapred.tasktracker.map.tasks.maximum</name>
>>> >>>>>    <value>4</value>
>>> >>>>>    <final>true</final>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>> >>>>>    <value>4</value>
>>> >>>>>    <final>true</final>
>>> >>>>>  </property>
>>> >>>>>  <property>
>>> >>>>>    <name>mapred.child.java.opts</name>
>>> >>>>>    <value>-Xmx512m</value>
>>> >>>>>    <!-- Not marked as final so jobs can include JVM
debugging options
>>> >>>>> -->
>>> >>>>>  </property>
>>> >>>>> </configuration>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>
>>> >>> --
>>> >>> ------------------------------
>>> >>> Chris Collord, ACS-PO 9/80 A
>>> >>> ------------------------------
>>> >>>
>>> >>>
>>> >>>
>>> >
>>> >
>>> > --
>>> > ------------------------------
>>> > Chris Collord, ACS-PO 9/80 A
>>> > ------------------------------
>>> >
>>> >
>>>
>>
>

Mime
View raw message