Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4AC25178.1070102@sci.utah.edu>
Date: Tue, 29 Sep 2009 12:27:04 -0600
From: Nick Rathke <nick@sci.utah.edu>
User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302)
MIME-Version: 1.0
To: common-user@hadoop.apache.org
Subject: Re: Running Hadoop on cluster with NFS booted systems
References: <4AC0E830.7030304@sci.utah.edu>
 <9C84A9C2-F360-401F-8E52-B87088F05475@cse.unl.edu>
 	<4AC16F39.1080509@sci.utah.edu>
 <45f85f70909282058h7b3b6207s46615821868f4340@mail.gmail.com>
 	<4AC188F8.1060409@sci.utah.edu>
 <45f85f70909282134s792ec7b8t265c2f31492059b1@mail.gmail.com>
 	<4AC22618.5080701@sci.utah.edu>
 <AA0F1186-46C9-475E-835D-DE522775F07B@cse.unl.edu>
 	<4AC2352C.50402@sci.utah.edu>
 <89B9B6F8-DEFA-4757-9FFD-FC4CE985C568@cse.unl.edu>
 <45f85f70909291043t3e2e1aa9x9e5ebcbf18c64297@mail.gmail.com>
In-Reply-To: <45f85f70909291043t3e2e1aa9x9e5ebcbf18c64297@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Great. I'll look at this fix. Here is what I got based on Brian's info

lsof -p gave me :

java    12739 root   50r   CHR                1,8              3335 
/dev/random
java    12739 root   51r   CHR                1,9              3325 
/dev/urandom

.
.
.
.

java    12739 root   66r   CHR                1,8              3335 
/dev/random

Both do exist in /dev

and securerandom.source=file

was already set to

securerandom.source=file:/dev/urandom

I have also checked that the permissions on said file are the same between nfs nodes and local OS nodes. 

-Nick


Todd Lipcon wrote:
> Yep, this is a common problem. The fix that Brian outlined helps a lot, but
> if you are *really* strapped for random bits, you'll still block. This is
> because even if you've set the random source, it still uses the real
> /dev/random to grab a seed for the prng, at least on my system.
>
> On systems where I know I don't care about true randomness, I also use this
> trick:
>
> http://www.chrissearle.org/blog/technical/increase_entropy_26_kernel_linux_box
>
> It's very handy for boxes running hudson that start and stop multinode
> pseudodistributed hadoop clusters regularly.
>
> -Todd
>
> On Tue, Sep 29, 2009 at 10:16 AM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:
>
>   
>> Hey Nick,
>>
>> Strange.  It appears that the Jetty server has stalled while trying to read
>> from /dev/random.  Is it possible that some part of /dev isn't initialized
>> before the datanode is launched?
>>
>> Can you confirm this using "lsof -p <process ID>" ?
>>
>> I copy/paste a solution I found in a forum via google below.
>>
>> Brian
>>
>> Edit $JAVA_HOME/jre/lib/security/java.security and change the property:
>>
>> securerandom.source=file:/dev/random
>>
>> to:
>>
>> securerandom.source=file:/dev/urandom
>>
>>
>> On Sep 29, 2009, at 11:26 AM, Nick Rathke wrote:
>>
>>  Thanks.  Here it is as in all of it's glory...
>>     
>>> -Nick
>>>
>>>
>>> 2009-09-29 09:15:53
>>> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode):
>>>
>>> "263851830@qtp0-1" prio=10 tid=0x00002aaaf846a000 nid=0x226b in
>>> Object.wait() [0x0000000041d24000]
>>>  java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>  at java.lang.Object.wait(Native Method)
>>>  - waiting on <0x00002aaade3587f8> (a
>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>  at
>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)
>>>  - locked <0x00002aaade3587f8> (a
>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>
>>> "1837007962@qtp0-0" prio=10 tid=0x00002aaaf84d4000 nid=0x226a in
>>> Object.wait() [0x0000000041b22000]
>>>  java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>  at java.lang.Object.wait(Native Method)
>>>  - waiting on <0x00002aaade3592b8> (a
>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>  at
>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)
>>>  - locked <0x00002aaade3592b8> (a
>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>
>>> "refreshUsed-/tmp/hadoop-root/dfs/data" daemon prio=10
>>> tid=0x00002aaaf8456000 nid=0x2269 waiting on condition [0x0000000041c23000]
>>>  java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>  at java.lang.Thread.sleep(Native Method)
>>>  at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80)
>>>  at java.lang.Thread.run(Thread.java:619)
>>>
>>> "RMI TCP Accept-0" daemon prio=10 tid=0x00002aaaf834d800 nid=0x225a
>>> runnable [0x000000004171e000]
>>>  java.lang.Thread.State: RUNNABLE
>>>  at java.net.PlainSocketImpl.socketAccept(Native Method)
>>>  at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
>>>  - locked <0x00002aaade358040> (a java.net.SocksSocketImpl)
>>>  at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>>>  at java.net.ServerSocket.accept(ServerSocket.java:421)
>>>  at
>>> sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
>>>  at
>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
>>>  at
>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
>>>  at java.lang.Thread.run(Thread.java:619)
>>>
>>> "Low Memory Detector" daemon prio=10 tid=0x00000000535f5000 nid=0x2259
>>> runnable [0x0000000000000000]
>>>  java.lang.Thread.State: RUNNABLE
>>>
>>> "CompilerThread1" daemon prio=10 tid=0x00000000535f1800 nid=0x2258 waiting
>>> on condition [0x0000000000000000]
>>>  java.lang.Thread.State: RUNNABLE
>>>
>>> "CompilerThread0" daemon prio=10 tid=0x00000000535ef000 nid=0x2257 waiting
>>> on condition [0x0000000000000000]
>>>  java.lang.Thread.State: RUNNABLE
>>>
>>> "Signal Dispatcher" daemon prio=10 tid=0x00000000535ec800 nid=0x2256
>>> waiting on condition [0x0000000000000000]
>>>  java.lang.Thread.State: RUNNABLE
>>>
>>> "Finalizer" daemon prio=10 tid=0x00000000535cf800 nid=0x2255 in
>>> Object.wait() [0x0000000041219000]
>>>  java.lang.Thread.State: WAITING (on object monitor)
>>>  at java.lang.Object.wait(Native Method)
>>>  - waiting on <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock)
>>>  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>>>  - locked <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock)
>>>  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>>>  at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>>>
>>> "Reference Handler" daemon prio=10 tid=0x00000000535c8000 nid=0x2254 in
>>> Object.wait() [0x0000000041118000]
>>>  java.lang.Thread.State: WAITING (on object monitor)
>>>  at java.lang.Object.wait(Native Method)
>>>  - waiting on <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
>>>  at java.lang.Object.wait(Object.java:485)
>>>  at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>>>  - locked <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
>>>
>>> "main" prio=10 tid=0x0000000053554000 nid=0x2245 runnable
>>> [0x0000000040208000]
>>>  java.lang.Thread.State: RUNNABLE
>>>  at java.io.FileInputStream.readBytes(Native Method)
>>>  at java.io.FileInputStream.read(FileInputStream.java:199)
>>>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>  at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>  - locked <0x00002aaade1e5870> (a java.io.BufferedInputStream)
>>>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>>  at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>  - locked <0x00002aaade1e29f8> (a java.io.BufferedInputStream)
>>>  at
>>> sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453)
>>>  at
>>> sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123)
>>>  at
>>> sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118)
>>>  at
>>> sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114)
>>>  at
>>> sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:171)
>>>  - locked <0x00002aaade1e2500> (a sun.security.provider.SecureRandom)
>>>  at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
>>>  - locked <0x00002aaade1e2830> (a java.security.SecureRandom)
>>>  at java.security.SecureRandom.next(SecureRandom.java:455)
>>>  at java.util.Random.nextLong(Random.java:284)
>>>  at
>>> org.mortbay.jetty.servlet.HashSessionIdManager.doStart(HashSessionIdManager.java:139)
>>>  at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>>  - locked <0x00002aaade1e21c0> (a java.lang.Object)
>>>  at
>>> org.mortbay.jetty.servlet.AbstractSessionManager.doStart(AbstractSessionManager.java:168)
>>>  at
>>> org.mortbay.jetty.servlet.HashSessionManager.doStart(HashSessionManager.java:67)
>>>  at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>>  - locked <0x00002aaade334c00> (a java.lang.Object)
>>>  at
>>> org.mortbay.jetty.servlet.SessionHandler.doStart(SessionHandler.java:115)
>>>  at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>>  - locked <0x00002aaade334b18> (a java.lang.Object)
>>>  at
>>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>>>  at
>>> org.mortbay.jetty.handler.ContextHandler.startContext(ContextHandler.java:537)
>>>  at org.mortbay.jetty.servlet.Context.startContext(Context.java:136)
>>>  at
>>> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1234)
>>>  at
>>> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)
>>>  at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:460)
>>>  at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>>  - locked <0x00002aaade334ab0> (a java.lang.Object)
>>>  at
>>> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>>>  at
>>> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
>>>  at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>>  - locked <0x00002aaade332c30> (a java.lang.Object)
>>>  at
>>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>>>  at org.mortbay.jetty.Server.doStart(Server.java:222)
>>>  at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>>  - locked <0x00002aaab44191a0> (a java.lang.Object)
>>>  at org.apache.hadoop.http.HttpServer.start(HttpServer.java:460)
>>>  at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:375)
>>>  at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
>>>  at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
>>>  at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
>>>  at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
>>>  at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
>>>
>>> "VM Thread" prio=10 tid=0x00000000535c1800 nid=0x2253 runnable
>>>
>>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000005355e000 nid=0x2246
>>> runnable
>>>
>>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000053560000 nid=0x2247
>>> runnable
>>>
>>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000053561800 nid=0x2248
>>> runnable
>>>
>>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000053563800 nid=0x2249
>>> runnable
>>>
>>> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000053565800 nid=0x224a
>>> runnable
>>>
>>> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000053567000 nid=0x224b
>>> runnable
>>>
>>> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000053569000 nid=0x224c
>>> runnable
>>>
>>> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000005356b000 nid=0x224d
>>> runnable
>>>
>>> "GC task thread#8 (ParallelGC)" prio=10 tid=0x000000005356c800 nid=0x224e
>>> runnable
>>>
>>> "GC task thread#9 (ParallelGC)" prio=10 tid=0x000000005356e800 nid=0x224f
>>> runnable
>>>
>>> "GC task thread#10 (ParallelGC)" prio=10 tid=0x0000000053570800 nid=0x2250
>>> runnable
>>>
>>> "GC task thread#11 (ParallelGC)" prio=10 tid=0x0000000053572000 nid=0x2251
>>> runnable
>>>
>>> "GC task thread#12 (ParallelGC)" prio=10 tid=0x0000000053574000 nid=0x2252
>>> runnable
>>>
>>> "VM Periodic Task Thread" prio=10 tid=0x00002aaaf835f800 nid=0x225b
>>> waiting on condition
>>>
>>> JNI global references: 715
>>>
>>> Heap
>>> PSYoungGen      total 5312K, used 5185K [0x00002aaaddde0000,
>>> 0x00002aaade5a0000, 0x00002aaaf2b30000)
>>> eden space 4416K, 97% used
>>> [0x00002aaaddde0000,0x00002aaade210688,0x00002aaade230000)
>>> from space 896K, 100% used
>>> [0x00002aaade320000,0x00002aaade400000,0x00002aaade400000)
>>> to   space 960K, 0% used
>>> [0x00002aaade230000,0x00002aaade230000,0x00002aaade320000)
>>> PSOldGen        total 5312K, used 1172K [0x00002aaab4330000,
>>> 0x00002aaab4860000, 0x00002aaaddde0000)
>>> object space 5312K, 22% used
>>> [0x00002aaab4330000,0x00002aaab44550b8,0x00002aaab4860000)
>>> PSPermGen       total 21248K, used 13354K [0x00002aaaaef30000,
>>> 0x00002aaab03f0000, 0x00002aaab4330000)
>>> object space 21248K, 62% used
>>> [0x00002aaaaef30000,0x00002aaaafc3a818,0x00002aaab03f0000)
>>>
>>>
>>>
>>>
>>> Brian Bockelman wrote:
>>>
>>>       
>>>> Hey Nick,
>>>>
>>>> I believe the mailing list stripped out your attachment.
>>>>
>>>> Brian
>>>>
>>>> On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote:
>>>>
>>>>  Hi,
>>>>         
>>>>> Here is the dump. I looked it over and unfortunately it is pretty
>>>>> meaningless to me at this point. Any help deciphering it would be greatly
>>>>> appreciated.
>>>>>
>>>>> I have also now disabled the IB interface on my 2 test systems,
>>>>> unfortunately that had no impact.
>>>>>
>>>>> -Nick
>>>>>
>>>>>
>>>>> Todd Lipcon wrote:
>>>>>
>>>>>           
>>>>>> Hi Nick,
>>>>>>
>>>>>> Figure out the pid of the DataNode process using either "jps" or
>>>>>> straight
>>>>>> "ps auxw | grep DataNode", and then kill -QUIT <pid>. That should cause
>>>>>> the
>>>>>> node to dump its stack to its stdout. That'll either end up in the .out
>>>>>> file
>>>>>> in your logs directory, or on your console, depending how you started
>>>>>> the
>>>>>> daemon.
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>> On Mon, Sep 28, 2009 at 9:11 PM, Nick Rathke <nick@sci.utah.edu>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>  Hi Todd,
>>>>>>             
>>>>>>> Unfortunately it never returns. Gives good info on a running node.
>>>>>>>
>>>>>>> -bash-3.2# curl http://127.0.0.1:50075/stacks
>>>>>>>
>>>>>>> If I do a stop-all on the master I get
>>>>>>>
>>>>>>> curl: (52) Empty reply from server
>>>>>>>
>>>>>>> on the stuck node.
>>>>>>>
>>>>>>> If I do this in a browser I can see that it is **trying** to connect,
>>>>>>> if I
>>>>>>> kill the java process I get "Server not found" but as long as the java
>>>>>>> process are running I just get a black page.
>>>>>>>
>>>>>>> Should I try a TCP dump and see if I can see packets flowing ? would
>>>>>>> that
>>>>>>> be of any help ?
>>>>>>>
>>>>>>> -Nick
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Todd Lipcon wrote:
>>>>>>>
>>>>>>>
>>>>>>>  Hi Nick,
>>>>>>>               
>>>>>>>> Can you curl http://127.0.0.1:50075/stacks on one of the stuck nodes
>>>>>>>> and
>>>>>>>> paste the result?
>>>>>>>>
>>>>>>>> Sometimes that can give an indication as to where things are getting
>>>>>>>> stuck.
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Mon, Sep 28, 2009 at 7:21 PM, Nick Rathke <nick@sci.utah.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  FYI I get the same hanging behavior if I follow the Hadoop quick
>>>>>>>>                 
>>>>>>>>> start
>>>>>>>>> for
>>>>>>>>> a single node base line configuration ( no modified conf files)
>>>>>>>>>
>>>>>>>>> -Nick
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Brian Bockelman wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Hey Nicke,
>>>>>>>>>                   
>>>>>>>>>> Do you have any error messages appearing in the log files?
>>>>>>>>>>
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>> On Sep 28, 2009, at 2:06 PM, Nick Rathke wrote:
>>>>>>>>>>
>>>>>>>>>> Ted Dunning wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  I think that the last time you asked this question, the suggestion
>>>>>>>>>>                     
>>>>>>>>>>> was
>>>>>>>>>>>
>>>>>>>>>>>  to
>>>>>>>>>>>                       
>>>>>>>>>>>> look at DNS and make sure that everything is exactly correct in
>>>>>>>>>>>> the
>>>>>>>>>>>> net-boot
>>>>>>>>>>>> configuration.  Hadoop is very sensitive to network routing and
>>>>>>>>>>>> naming
>>>>>>>>>>>> details.
>>>>>>>>>>>>
>>>>>>>>>>>> So,
>>>>>>>>>>>>
>>>>>>>>>>>> a) in your net-boot, how are IP addresses assigned?
>>>>>>>>>>>>
>>>>>>>>>>>> We assign static IP's based on a node's MAC address via DHCP so
>>>>>>>>>>>> that
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  when a node is netbooted or booted with a local OS it gets the
>>>>>>>>>>>>                         
>>>>>>>>>>> same IP
>>>>>>>>>>> and
>>>>>>>>>>> hostname.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  b) how are DNS names propagated?
>>>>>>>>>>>                       
>>>>>>>>>>>> cluster DNS names are on a mixed in with our facility DNS
>>>>>>>>>>>> servers.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  All nodes have proper forward and reverse DNS lookups.
>>>>>>>>>>>>                         
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  c) how have you guaranteed that (a) and (b) are exactly
>>>>>>>>>>>                       
>>>>>>>>>>>> consistent?
>>>>>>>>>>>>
>>>>>>>>>>>> Host MAC address. I also have manually conformed this.
>>>>>>>>>>>>       d) how have your guaranteed that every node can talk to
>>>>>>>>>>>> every
>>>>>>>>>>>> other node
>>>>>>>>>>>> both by name and IP address?
>>>>>>>>>>>>
>>>>>>>>>>>> Local cluster DNS / DHCP + all nodes have all other nodes host
>>>>>>>>>>>> names
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  and IP's in /etc/hosts. I have compared all the config files for
>>>>>>>>>>>>                         
>>>>>>>>>>> DNS /
>>>>>>>>>>> DHCP
>>>>>>>>>>> / and /etc/hosts to make sure all information is the same.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  e) have you assured yourself that any reverse mapping that exists
>>>>>>>>>>>                       
>>>>>>>>>>>> is
>>>>>>>>>>>> correct?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, and tested.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  One more bit of information. The system boots on a 1Gb network
>>>>>>>>>>>>                         
>>>>>>>>>>> all
>>>>>>>>>>> other
>>>>>>>>>>> network traffic i.e. MPI and NFS to data volumes is via IB.
>>>>>>>>>>>
>>>>>>>>>>> The IB network also has proper forward/backwards DNS entries. IB
>>>>>>>>>>> IP
>>>>>>>>>>> address are setup at boot time via a script that takes the host IP
>>>>>>>>>>> and
>>>>>>>>>>> a
>>>>>>>>>>> fixed offset to calculate the address for the IB interface. I have
>>>>>>>>>>> also
>>>>>>>>>>> confirmed that the IB IP address's match our DNS .
>>>>>>>>>>>
>>>>>>>>>>> -Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 28, 2009 at 9:45 AM, Nick Rathke <nick@sci.utah.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  I am hopping that someone can help with this issue. I have a 64
>>>>>>>>>>>                       
>>>>>>>>>>>> node
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  cluster that we would like to run Hadoop on, most of the nodes
>>>>>>>>>>>>                         
>>>>>>>>>>>>> are
>>>>>>>>>>>>> netbooted
>>>>>>>>>>>>> via NFS.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hadoop runs fine on nodes IF the node uses a local OS install,
>>>>>>>>>>>>> but
>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>> work when nodes are netbooted. Under netboot I can see that the
>>>>>>>>>>>>> slaves
>>>>>>>>>>>>> have
>>>>>>>>>>>>> the correct Java processes running, but the Hadoop web pages
>>>>>>>>>>>>> never
>>>>>>>>>>>>> shows the
>>>>>>>>>>>>> nodes as available. The Hadoop logs on the nodes also show that
>>>>>>>>>>>>> everything
>>>>>>>>>>>>> is running and started up correctly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On the few node that have a local OS installed everything works
>>>>>>>>>>>>> just
>>>>>>>>>>>>> fine
>>>>>>>>>>>>> and I can run the test jobs without issue (so far).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I  am using the identical hadoop install and configuration
>>>>>>>>>>>>> between
>>>>>>>>>>>>> netbooted nodes and none netbooted nodes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Has anyone encountered this type of issue ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                           
>>>>>>>>>>>>  --
>>>>>>>>>>>>                         
>>>>>>>>>>> Nick Rathke
>>>>>>>>>>> Scientific Computing and Imaging Institute
>>>>>>>>>>> Sr. Systems Administrator
>>>>>>>>>>> nick@sci.utah.edu
>>>>>>>>>>> www.sci.utah.edu
>>>>>>>>>>> 801-587-9933
>>>>>>>>>>> 801-557-3832
>>>>>>>>>>>
>>>>>>>>>>> "I came I saw I made it possible" Royal Bliss - Here They Come
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>             
>>>>> --
>>>>> Nick Rathke
>>>>> Scientific Computing and Imaging Institute
>>>>> Sr. Systems Administrator
>>>>> nick@sci.utah.edu
>>>>> www.sci.utah.edu
>>>>> 801-587-9933
>>>>> 801-557-3832
>>>>>
>>>>> "I came I saw I made it possible" Royal Bliss - Here They Come
>>>>>
>>>>>           
>>>>         
>>> --
>>> Nick Rathke
>>> Scientific Computing and Imaging Institute
>>> Sr. Systems Administrator
>>> nick@sci.utah.edu
>>> www.sci.utah.edu
>>> 801-587-9933
>>> 801-557-3832
>>>
>>> "I came I saw I made it possible" Royal Bliss - Here They Come
>>>
>>>       
>>     
>
>   


-- 
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
nick@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832

"I came I saw I made it possible" Royal Bliss - Here They Come