hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Running Hadoop on cluster with NFS booted systems
Date Tue, 29 Sep 2009 19:01:16 GMT
Hey Nick,

Try this:
cat /proc/sys/kernel/random/entropy_avail

Is it a small number (<300)?

Basically, one way Linux generates entropy is via input from the  
keyboard.  So, as soon as you log into the NFS booted server, you've  
given it enough entropy for HDFS to start up.

Here's a relevant-looking link:

http://rackerhacker.com/2007/07/01/check-available-entropy-in-linux/

Brian

On Sep 29, 2009, at 1:27 PM, Nick Rathke wrote:

> Great. I'll look at this fix. Here is what I got based on Brian's info
>
> lsof -p gave me :
>
> java    12739 root   50r   CHR                1,8              3335 / 
> dev/random
> java    12739 root   51r   CHR                1,9              3325 / 
> dev/urandom
>
> .
> .
> .
> .
>
> java    12739 root   66r   CHR                1,8              3335 / 
> dev/random
>
> Both do exist in /dev
>
> and securerandom.source=file
>
> was already set to
>
> securerandom.source=file:/dev/urandom
>
> I have also checked that the permissions on said file are the same  
> between nfs nodes and local OS nodes.
> -Nick
>
>
>
> Todd Lipcon wrote:
>> Yep, this is a common problem. The fix that Brian outlined helps a  
>> lot, but
>> if you are *really* strapped for random bits, you'll still block.  
>> This is
>> because even if you've set the random source, it still uses the real
>> /dev/random to grab a seed for the prng, at least on my system.
>>
>> On systems where I know I don't care about true randomness, I also  
>> use this
>> trick:
>>
>> http://www.chrissearle.org/blog/technical/increase_entropy_26_kernel_linux_box
>>
>> It's very handy for boxes running hudson that start and stop  
>> multinode
>> pseudodistributed hadoop clusters regularly.
>>
>> -Todd
>>
>> On Tue, Sep 29, 2009 at 10:16 AM, Brian Bockelman <bbockelm@cse.unl.edu 
>> >wrote:
>>
>>
>>> Hey Nick,
>>>
>>> Strange.  It appears that the Jetty server has stalled while  
>>> trying to read
>>> from /dev/random.  Is it possible that some part of /dev isn't  
>>> initialized
>>> before the datanode is launched?
>>>
>>> Can you confirm this using "lsof -p <process ID>" ?
>>>
>>> I copy/paste a solution I found in a forum via google below.
>>>
>>> Brian
>>>
>>> Edit $JAVA_HOME/jre/lib/security/java.security and change the  
>>> property:
>>>
>>> securerandom.source=file:/dev/random
>>>
>>> to:
>>>
>>> securerandom.source=file:/dev/urandom
>>>
>>>
>>> On Sep 29, 2009, at 11:26 AM, Nick Rathke wrote:
>>>
>>> Thanks.  Here it is as in all of it's glory...
>>>
>>>> -Nick
>>>>
>>>>
>>>> 2009-09-29 09:15:53
>>>> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01  
>>>> mixed mode):
>>>>
>>>> "263851830@qtp0-1" prio=10 tid=0x00002aaaf846a000 nid=0x226b in
>>>> Object.wait() [0x0000000041d24000]
>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>> at java.lang.Object.wait(Native Method)
>>>> - waiting on <0x00002aaade3587f8> (a
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>> at
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run 
>>>> (QueuedThreadPool.java:565)
>>>> - locked <0x00002aaade3587f8> (a
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>>
>>>> "1837007962@qtp0-0" prio=10 tid=0x00002aaaf84d4000 nid=0x226a in
>>>> Object.wait() [0x0000000041b22000]
>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>> at java.lang.Object.wait(Native Method)
>>>> - waiting on <0x00002aaade3592b8> (a
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>> at
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run 
>>>> (QueuedThreadPool.java:565)
>>>> - locked <0x00002aaade3592b8> (a
>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>>
>>>> "refreshUsed-/tmp/hadoop-root/dfs/data" daemon prio=10
>>>> tid=0x00002aaaf8456000 nid=0x2269 waiting on condition  
>>>> [0x0000000041c23000]
>>>> java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>> at java.lang.Thread.sleep(Native Method)
>>>> at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80)
>>>> at java.lang.Thread.run(Thread.java:619)
>>>>
>>>> "RMI TCP Accept-0" daemon prio=10 tid=0x00002aaaf834d800 nid=0x225a
>>>> runnable [0x000000004171e000]
>>>> java.lang.Thread.State: RUNNABLE
>>>> at java.net.PlainSocketImpl.socketAccept(Native Method)
>>>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
>>>> - locked <0x00002aaade358040> (a java.net.SocksSocketImpl)
>>>> at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>>>> at java.net.ServerSocket.accept(ServerSocket.java:421)
>>>> at
>>>> sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept 
>>>> (LocalRMIServerSocketFactory.java:34)
>>>> at
>>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop 
>>>> (TCPTransport.java:369)
>>>> at
>>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run 
>>>> (TCPTransport.java:341)
>>>> at java.lang.Thread.run(Thread.java:619)
>>>>
>>>> "Low Memory Detector" daemon prio=10 tid=0x00000000535f5000  
>>>> nid=0x2259
>>>> runnable [0x0000000000000000]
>>>> java.lang.Thread.State: RUNNABLE
>>>>
>>>> "CompilerThread1" daemon prio=10 tid=0x00000000535f1800  
>>>> nid=0x2258 waiting
>>>> on condition [0x0000000000000000]
>>>> java.lang.Thread.State: RUNNABLE
>>>>
>>>> "CompilerThread0" daemon prio=10 tid=0x00000000535ef000  
>>>> nid=0x2257 waiting
>>>> on condition [0x0000000000000000]
>>>> java.lang.Thread.State: RUNNABLE
>>>>
>>>> "Signal Dispatcher" daemon prio=10 tid=0x00000000535ec800  
>>>> nid=0x2256
>>>> waiting on condition [0x0000000000000000]
>>>> java.lang.Thread.State: RUNNABLE
>>>>
>>>> "Finalizer" daemon prio=10 tid=0x00000000535cf800 nid=0x2255 in
>>>> Object.wait() [0x0000000041219000]
>>>> java.lang.Thread.State: WAITING (on object monitor)
>>>> at java.lang.Object.wait(Native Method)
>>>> - waiting on <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue 
>>>> $Lock)
>>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>>>> - locked <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock)
>>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>>>> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>>>>
>>>> "Reference Handler" daemon prio=10 tid=0x00000000535c8000  
>>>> nid=0x2254 in
>>>> Object.wait() [0x0000000041118000]
>>>> java.lang.Thread.State: WAITING (on object monitor)
>>>> at java.lang.Object.wait(Native Method)
>>>> - waiting on <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
>>>> at java.lang.Object.wait(Object.java:485)
>>>> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>>>> - locked <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
>>>>
>>>> "main" prio=10 tid=0x0000000053554000 nid=0x2245 runnable
>>>> [0x0000000040208000]
>>>> java.lang.Thread.State: RUNNABLE
>>>> at java.io.FileInputStream.readBytes(Native Method)
>>>> at java.io.FileInputStream.read(FileInputStream.java:199)
>>>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>> - locked <0x00002aaade1e5870> (a java.io.BufferedInputStream)
>>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>> - locked <0x00002aaade1e29f8> (a java.io.BufferedInputStream)
>>>> at
>>>> sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte 
>>>> (SeedGenerator.java:453)
>>>> at
>>>> sun.security.provider.SeedGenerator.getSeedBytes 
>>>> (SeedGenerator.java:123)
>>>> at
>>>> sun.security.provider.SeedGenerator.generateSeed 
>>>> (SeedGenerator.java:118)
>>>> at
>>>> sun.security.provider.SecureRandom.engineGenerateSeed 
>>>> (SecureRandom.java:114)
>>>> at
>>>> sun.security.provider.SecureRandom.engineNextBytes 
>>>> (SecureRandom.java:171)
>>>> - locked <0x00002aaade1e2500> (a  
>>>> sun.security.provider.SecureRandom)
>>>> at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
>>>> - locked <0x00002aaade1e2830> (a java.security.SecureRandom)
>>>> at java.security.SecureRandom.next(SecureRandom.java:455)
>>>> at java.util.Random.nextLong(Random.java:284)
>>>> at
>>>> org.mortbay.jetty.servlet.HashSessionIdManager.doStart 
>>>> (HashSessionIdManager.java:139)
>>>> at
>>>> org.mortbay.component.AbstractLifeCycle.start 
>>>> (AbstractLifeCycle.java:50)
>>>> - locked <0x00002aaade1e21c0> (a java.lang.Object)
>>>> at
>>>> org.mortbay.jetty.servlet.AbstractSessionManager.doStart 
>>>> (AbstractSessionManager.java:168)
>>>> at
>>>> org.mortbay.jetty.servlet.HashSessionManager.doStart 
>>>> (HashSessionManager.java:67)
>>>> at
>>>> org.mortbay.component.AbstractLifeCycle.start 
>>>> (AbstractLifeCycle.java:50)
>>>> - locked <0x00002aaade334c00> (a java.lang.Object)
>>>> at
>>>> org.mortbay.jetty.servlet.SessionHandler.doStart 
>>>> (SessionHandler.java:115)
>>>> at
>>>> org.mortbay.component.AbstractLifeCycle.start 
>>>> (AbstractLifeCycle.java:50)
>>>> - locked <0x00002aaade334b18> (a java.lang.Object)
>>>> at
>>>> org.mortbay.jetty.handler.HandlerWrapper.doStart 
>>>> (HandlerWrapper.java:130)
>>>> at
>>>> org.mortbay.jetty.handler.ContextHandler.startContext 
>>>> (ContextHandler.java:537)
>>>> at org.mortbay.jetty.servlet.Context.startContext(Context.java:136)
>>>> at
>>>> org.mortbay.jetty.webapp.WebAppContext.startContext 
>>>> (WebAppContext.java:1234)
>>>> at
>>>> org.mortbay.jetty.handler.ContextHandler.doStart 
>>>> (ContextHandler.java:517)
>>>> at org.mortbay.jetty.webapp.WebAppContext.doStart 
>>>> (WebAppContext.java:460)
>>>> at
>>>> org.mortbay.component.AbstractLifeCycle.start 
>>>> (AbstractLifeCycle.java:50)
>>>> - locked <0x00002aaade334ab0> (a java.lang.Object)
>>>> at
>>>> org.mortbay.jetty.handler.HandlerCollection.doStart 
>>>> (HandlerCollection.java:152)
>>>> at
>>>> org.mortbay.jetty.handler.ContextHandlerCollection.doStart 
>>>> (ContextHandlerCollection.java:156)
>>>> at
>>>> org.mortbay.component.AbstractLifeCycle.start 
>>>> (AbstractLifeCycle.java:50)
>>>> - locked <0x00002aaade332c30> (a java.lang.Object)
>>>> at
>>>> org.mortbay.jetty.handler.HandlerWrapper.doStart 
>>>> (HandlerWrapper.java:130)
>>>> at org.mortbay.jetty.Server.doStart(Server.java:222)
>>>> at
>>>> org.mortbay.component.AbstractLifeCycle.start 
>>>> (AbstractLifeCycle.java:50)
>>>> - locked <0x00002aaab44191a0> (a java.lang.Object)
>>>> at org.apache.hadoop.http.HttpServer.start(HttpServer.java:460)
>>>> at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode 
>>>> (DataNode.java:375)
>>>> at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.<init> 
>>>> (DataNode.java:216)
>>>> at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance 
>>>> (DataNode.java:1283)
>>>> at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode 
>>>> (DataNode.java:1238)
>>>> at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode 
>>>> (DataNode.java:1246)
>>>> at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.main 
>>>> (DataNode.java:1368)
>>>>
>>>> "VM Thread" prio=10 tid=0x00000000535c1800 nid=0x2253 runnable
>>>>
>>>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000005355e000  
>>>> nid=0x2246
>>>> runnable
>>>>
>>>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000053560000  
>>>> nid=0x2247
>>>> runnable
>>>>
>>>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000053561800  
>>>> nid=0x2248
>>>> runnable
>>>>
>>>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000053563800  
>>>> nid=0x2249
>>>> runnable
>>>>
>>>> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000053565800  
>>>> nid=0x224a
>>>> runnable
>>>>
>>>> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000053567000  
>>>> nid=0x224b
>>>> runnable
>>>>
>>>> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000053569000  
>>>> nid=0x224c
>>>> runnable
>>>>
>>>> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000005356b000  
>>>> nid=0x224d
>>>> runnable
>>>>
>>>> "GC task thread#8 (ParallelGC)" prio=10 tid=0x000000005356c800  
>>>> nid=0x224e
>>>> runnable
>>>>
>>>> "GC task thread#9 (ParallelGC)" prio=10 tid=0x000000005356e800  
>>>> nid=0x224f
>>>> runnable
>>>>
>>>> "GC task thread#10 (ParallelGC)" prio=10 tid=0x0000000053570800  
>>>> nid=0x2250
>>>> runnable
>>>>
>>>> "GC task thread#11 (ParallelGC)" prio=10 tid=0x0000000053572000  
>>>> nid=0x2251
>>>> runnable
>>>>
>>>> "GC task thread#12 (ParallelGC)" prio=10 tid=0x0000000053574000  
>>>> nid=0x2252
>>>> runnable
>>>>
>>>> "VM Periodic Task Thread" prio=10 tid=0x00002aaaf835f800 nid=0x225b
>>>> waiting on condition
>>>>
>>>> JNI global references: 715
>>>>
>>>> Heap
>>>> PSYoungGen      total 5312K, used 5185K [0x00002aaaddde0000,
>>>> 0x00002aaade5a0000, 0x00002aaaf2b30000)
>>>> eden space 4416K, 97% used
>>>> [0x00002aaaddde0000,0x00002aaade210688,0x00002aaade230000)
>>>> from space 896K, 100% used
>>>> [0x00002aaade320000,0x00002aaade400000,0x00002aaade400000)
>>>> to   space 960K, 0% used
>>>> [0x00002aaade230000,0x00002aaade230000,0x00002aaade320000)
>>>> PSOldGen        total 5312K, used 1172K [0x00002aaab4330000,
>>>> 0x00002aaab4860000, 0x00002aaaddde0000)
>>>> object space 5312K, 22% used
>>>> [0x00002aaab4330000,0x00002aaab44550b8,0x00002aaab4860000)
>>>> PSPermGen       total 21248K, used 13354K [0x00002aaaaef30000,
>>>> 0x00002aaab03f0000, 0x00002aaab4330000)
>>>> object space 21248K, 62% used
>>>> [0x00002aaaaef30000,0x00002aaaafc3a818,0x00002aaab03f0000)
>>>>
>>>>
>>>>
>>>>
>>>> Brian Bockelman wrote:
>>>>
>>>>
>>>>> Hey Nick,
>>>>>
>>>>> I believe the mailing list stripped out your attachment.
>>>>>
>>>>> Brian
>>>>>
>>>>> On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> Here is the dump. I looked it over and unfortunately it is pretty
>>>>>> meaningless to me at this point. Any help deciphering it would  
>>>>>> be greatly
>>>>>> appreciated.
>>>>>>
>>>>>> I have also now disabled the IB interface on my 2 test systems,
>>>>>> unfortunately that had no impact.
>>>>>>
>>>>>> -Nick
>>>>>>
>>>>>>
>>>>>> Todd Lipcon wrote:
>>>>>>
>>>>>>
>>>>>>> Hi Nick,
>>>>>>>
>>>>>>> Figure out the pid of the DataNode process using either "jps"
or
>>>>>>> straight
>>>>>>> "ps auxw | grep DataNode", and then kill -QUIT <pid>. That
 
>>>>>>> should cause
>>>>>>> the
>>>>>>> node to dump its stack to its stdout. That'll either end up in
 
>>>>>>> the .out
>>>>>>> file
>>>>>>> in your logs directory, or on your console, depending how you
 
>>>>>>> started
>>>>>>> the
>>>>>>> daemon.
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>> On Mon, Sep 28, 2009 at 9:11 PM, Nick Rathke <nick@sci.utah.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Todd,
>>>>>>>
>>>>>>>> Unfortunately it never returns. Gives good info on a running
 
>>>>>>>> node.
>>>>>>>>
>>>>>>>> -bash-3.2# curl http://127.0.0.1:50075/stacks
>>>>>>>>
>>>>>>>> If I do a stop-all on the master I get
>>>>>>>>
>>>>>>>> curl: (52) Empty reply from server
>>>>>>>>
>>>>>>>> on the stuck node.
>>>>>>>>
>>>>>>>> If I do this in a browser I can see that it is **trying**
to  
>>>>>>>> connect,
>>>>>>>> if I
>>>>>>>> kill the java process I get "Server not found" but as long
as  
>>>>>>>> the java
>>>>>>>> process are running I just get a black page.
>>>>>>>>
>>>>>>>> Should I try a TCP dump and see if I can see packets  
>>>>>>>> flowing ? would
>>>>>>>> that
>>>>>>>> be of any help ?
>>>>>>>>
>>>>>>>> -Nick
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Todd Lipcon wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Nick,
>>>>>>>>
>>>>>>>>> Can you curl http://127.0.0.1:50075/stacks on one of
the  
>>>>>>>>> stuck nodes
>>>>>>>>> and
>>>>>>>>> paste the result?
>>>>>>>>>
>>>>>>>>> Sometimes that can give an indication as to where things
are  
>>>>>>>>> getting
>>>>>>>>> stuck.
>>>>>>>>>
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>> On Mon, Sep 28, 2009 at 7:21 PM, Nick Rathke <nick@sci.utah.edu

>>>>>>>>> >
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FYI I get the same hanging behavior if I follow the Hadoop
 
>>>>>>>>> quick
>>>>>>>>>
>>>>>>>>>> start
>>>>>>>>>> for
>>>>>>>>>> a single node base line configuration ( no modified
conf  
>>>>>>>>>> files)
>>>>>>>>>>
>>>>>>>>>> -Nick
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Brian Bockelman wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hey Nicke,
>>>>>>>>>>
>>>>>>>>>>> Do you have any error messages appearing in the
log files?
>>>>>>>>>>>
>>>>>>>>>>> Brian
>>>>>>>>>>>
>>>>>>>>>>> On Sep 28, 2009, at 2:06 PM, Nick Rathke wrote:
>>>>>>>>>>>
>>>>>>>>>>> Ted Dunning wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think that the last time you asked this question,
the  
>>>>>>>>>>> suggestion
>>>>>>>>>>>
>>>>>>>>>>>> was
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>> look at DNS and make sure that everything
is exactly  
>>>>>>>>>>>>> correct in
>>>>>>>>>>>>> the
>>>>>>>>>>>>> net-boot
>>>>>>>>>>>>> configuration.  Hadoop is very sensitive
to network  
>>>>>>>>>>>>> routing and
>>>>>>>>>>>>> naming
>>>>>>>>>>>>> details.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So,
>>>>>>>>>>>>>
>>>>>>>>>>>>> a) in your net-boot, how are IP addresses
assigned?
>>>>>>>>>>>>>
>>>>>>>>>>>>> We assign static IP's based on a node's
MAC address via  
>>>>>>>>>>>>> DHCP so
>>>>>>>>>>>>> that
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> when a node is netbooted or booted with
a local OS it  
>>>>>>>>>>>>> gets the
>>>>>>>>>>>>>
>>>>>>>>>>>> same IP
>>>>>>>>>>>> and
>>>>>>>>>>>> hostname.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> b) how are DNS names propagated?
>>>>>>>>>>>>
>>>>>>>>>>>>> cluster DNS names are on a mixed in with
our facility DNS
>>>>>>>>>>>>> servers.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> All nodes have proper forward and reverse
DNS lookups.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> c) how have you guaranteed that (a) and (b)
are exactly
>>>>>>>>>>>>
>>>>>>>>>>>>> consistent?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Host MAC address. I also have manually
conformed this.
>>>>>>>>>>>>>      d) how have your guaranteed that
every node can  
>>>>>>>>>>>>> talk to
>>>>>>>>>>>>> every
>>>>>>>>>>>>> other node
>>>>>>>>>>>>> both by name and IP address?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Local cluster DNS / DHCP + all nodes
have all other  
>>>>>>>>>>>>> nodes host
>>>>>>>>>>>>> names
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> and IP's in /etc/hosts. I have compared
all the config  
>>>>>>>>>>>>> files for
>>>>>>>>>>>>>
>>>>>>>>>>>> DNS /
>>>>>>>>>>>> DHCP
>>>>>>>>>>>> / and /etc/hosts to make sure all information
is the same.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> e) have you assured yourself that any reverse
mapping  
>>>>>>>>>>>> that exists
>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>> correct?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, and tested.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> One more bit of information. The system
boots on a 1Gb  
>>>>>>>>>>>>> network
>>>>>>>>>>>>>
>>>>>>>>>>>> all
>>>>>>>>>>>> other
>>>>>>>>>>>> network traffic i.e. MPI and NFS to data
volumes is via IB.
>>>>>>>>>>>>
>>>>>>>>>>>> The IB network also has proper forward/backwards
DNS  
>>>>>>>>>>>> entries. IB
>>>>>>>>>>>> IP
>>>>>>>>>>>> address are setup at boot time via a script
that takes  
>>>>>>>>>>>> the host IP
>>>>>>>>>>>> and
>>>>>>>>>>>> a
>>>>>>>>>>>> fixed offset to calculate the address for
the IB  
>>>>>>>>>>>> interface. I have
>>>>>>>>>>>> also
>>>>>>>>>>>> confirmed that the IB IP address's match
our DNS .
>>>>>>>>>>>>
>>>>>>>>>>>> -Nick
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 28, 2009 at 9:45 AM, Nick Rathke
<nick@sci.utah.edu 
>>>>>>>>>>>> >
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am hopping that someone can help with this
issue. I  
>>>>>>>>>>>> have a 64
>>>>>>>>>>>>
>>>>>>>>>>>>> node
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> cluster that we would like to run Hadoop
on, most of the  
>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> netbooted
>>>>>>>>>>>>>> via NFS.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hadoop runs fine on nodes IF the
node uses a local OS  
>>>>>>>>>>>>>> install,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>> work when nodes are netbooted. Under
netboot I can see  
>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>> slaves
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> the correct Java processes running,
but the Hadoop web  
>>>>>>>>>>>>>> pages
>>>>>>>>>>>>>> never
>>>>>>>>>>>>>> shows the
>>>>>>>>>>>>>> nodes as available. The Hadoop logs
on the nodes also  
>>>>>>>>>>>>>> show that
>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>> is running and started up correctly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On the few node that have a local
OS installed  
>>>>>>>>>>>>>> everything works
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>> and I can run the test jobs without
issue (so far).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I  am using the identical hadoop
install and  
>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>> between
>>>>>>>>>>>>>> netbooted nodes and none netbooted
nodes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Has anyone encountered this type
of issue ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>> Nick Rathke
>>>>>>>>>>>> Scientific Computing and Imaging Institute
>>>>>>>>>>>> Sr. Systems Administrator
>>>>>>>>>>>> nick@sci.utah.edu
>>>>>>>>>>>> www.sci.utah.edu
>>>>>>>>>>>> 801-587-9933
>>>>>>>>>>>> 801-557-3832
>>>>>>>>>>>>
>>>>>>>>>>>> "I came I saw I made it possible" Royal Bliss
- Here They  
>>>>>>>>>>>> Come
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Nick Rathke
>>>>>> Scientific Computing and Imaging Institute
>>>>>> Sr. Systems Administrator
>>>>>> nick@sci.utah.edu
>>>>>> www.sci.utah.edu
>>>>>> 801-587-9933
>>>>>> 801-557-3832
>>>>>>
>>>>>> "I came I saw I made it possible" Royal Bliss - Here They Come
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> Nick Rathke
>>>> Scientific Computing and Imaging Institute
>>>> Sr. Systems Administrator
>>>> nick@sci.utah.edu
>>>> www.sci.utah.edu
>>>> 801-587-9933
>>>> 801-557-3832
>>>>
>>>> "I came I saw I made it possible" Royal Bliss - Here They Come
>>>>
>>>>
>>>
>>
>>
>
>
> -- 
> Nick Rathke
> Scientific Computing and Imaging Institute
> Sr. Systems Administrator
> nick@sci.utah.edu
> www.sci.utah.edu
> 801-587-9933
> 801-557-3832
>
> "I came I saw I made it possible" Royal Bliss - Here They Come


Mime
View raw message