hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Rathke <n...@sci.utah.edu>
Subject Re: Running Hadoop on cluster with NFS booted systems
Date Tue, 29 Sep 2009 21:45:16 GMT
Hi Brian / Todd,

-bash-3.2# cat /proc/sys/kernel/random/entropy_avail
128

So I did

 rngd -r /dev/urandom -o /dev/random -f -t 1 &

and it **seems** to be working.. The web page shows the nodes as there 
and the logs seem to show that the clients have started correctly, but I 
have not yet tried to run any jobs.

Thanks for all your help!!

-Nick

Brian Bockelman wrote:
> Hey Nick,
>
> Try this:
> cat /proc/sys/kernel/random/entropy_avail
>
> Is it a small number (<300)?
>
> Basically, one way Linux generates entropy is via input from the 
> keyboard.  So, as soon as you log into the NFS booted server, you've 
> given it enough entropy for HDFS to start up.
>
> Here's a relevant-looking link:
>
> http://rackerhacker.com/2007/07/01/check-available-entropy-in-linux/
>
> Brian
>
> On Sep 29, 2009, at 1:27 PM, Nick Rathke wrote:
>
>> Great. I'll look at this fix. Here is what I got based on Brian's info
>>
>> lsof -p gave me :
>>
>> java    12739 root   50r   CHR                1,8              3335 
>> /dev/random
>> java    12739 root   51r   CHR                1,9              3325 
>> /dev/urandom
>>
>> .
>> .
>> .
>> .
>>
>> java    12739 root   66r   CHR                1,8              3335 
>> /dev/random
>>
>> Both do exist in /dev
>>
>> and securerandom.source=file
>>
>> was already set to
>>
>> securerandom.source=file:/dev/urandom
>>
>> I have also checked that the permissions on said file are the same 
>> between nfs nodes and local OS nodes.
>> -Nick
>>
>>
>>
>> Todd Lipcon wrote:
>>> Yep, this is a common problem. The fix that Brian outlined helps a 
>>> lot, but
>>> if you are *really* strapped for random bits, you'll still block. 
>>> This is
>>> because even if you've set the random source, it still uses the real
>>> /dev/random to grab a seed for the prng, at least on my system.
>>>
>>> On systems where I know I don't care about true randomness, I also 
>>> use this
>>> trick:
>>>
>>> http://www.chrissearle.org/blog/technical/increase_entropy_26_kernel_linux_box

>>>
>>>
>>> It's very handy for boxes running hudson that start and stop multinode
>>> pseudodistributed hadoop clusters regularly.
>>>
>>> -Todd
>>>
>>> On Tue, Sep 29, 2009 at 10:16 AM, Brian Bockelman 
>>> <bbockelm@cse.unl.edu>wrote:
>>>
>>>
>>>> Hey Nick,
>>>>
>>>> Strange.  It appears that the Jetty server has stalled while trying 
>>>> to read
>>>> from /dev/random.  Is it possible that some part of /dev isn't 
>>>> initialized
>>>> before the datanode is launched?
>>>>
>>>> Can you confirm this using "lsof -p <process ID>" ?
>>>>
>>>> I copy/paste a solution I found in a forum via google below.
>>>>
>>>> Brian
>>>>
>>>> Edit $JAVA_HOME/jre/lib/security/java.security and change the 
>>>> property:
>>>>
>>>> securerandom.source=file:/dev/random
>>>>
>>>> to:
>>>>
>>>> securerandom.source=file:/dev/urandom
>>>>
>>>>
>>>> On Sep 29, 2009, at 11:26 AM, Nick Rathke wrote:
>>>>
>>>> Thanks.  Here it is as in all of it's glory...
>>>>
>>>>> -Nick
>>>>>
>>>>>
>>>>> 2009-09-29 09:15:53
>>>>> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed 
>>>>> mode):
>>>>>
>>>>> "263851830@qtp0-1" prio=10 tid=0x00002aaaf846a000 nid=0x226b in
>>>>> Object.wait() [0x0000000041d24000]
>>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>>> at java.lang.Object.wait(Native Method)
>>>>> - waiting on <0x00002aaade3587f8> (a
>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>>> at
>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)

>>>>>
>>>>> - locked <0x00002aaade3587f8> (a
>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>>>
>>>>> "1837007962@qtp0-0" prio=10 tid=0x00002aaaf84d4000 nid=0x226a in
>>>>> Object.wait() [0x0000000041b22000]
>>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>>> at java.lang.Object.wait(Native Method)
>>>>> - waiting on <0x00002aaade3592b8> (a
>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>>> at
>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)

>>>>>
>>>>> - locked <0x00002aaade3592b8> (a
>>>>> org.mortbay.thread.QueuedThreadPool$PoolThread)
>>>>>
>>>>> "refreshUsed-/tmp/hadoop-root/dfs/data" daemon prio=10
>>>>> tid=0x00002aaaf8456000 nid=0x2269 waiting on condition 
>>>>> [0x0000000041c23000]
>>>>> java.lang.Thread.State: TIMED_WAITING (sleeping)
>>>>> at java.lang.Thread.sleep(Native Method)
>>>>> at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80)
>>>>> at java.lang.Thread.run(Thread.java:619)
>>>>>
>>>>> "RMI TCP Accept-0" daemon prio=10 tid=0x00002aaaf834d800 nid=0x225a
>>>>> runnable [0x000000004171e000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>> at java.net.PlainSocketImpl.socketAccept(Native Method)
>>>>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
>>>>> - locked <0x00002aaade358040> (a java.net.SocksSocketImpl)
>>>>> at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>>>>> at java.net.ServerSocket.accept(ServerSocket.java:421)
>>>>> at
>>>>> sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)

>>>>>
>>>>> at
>>>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)

>>>>>
>>>>> at
>>>>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)

>>>>>
>>>>> at java.lang.Thread.run(Thread.java:619)
>>>>>
>>>>> "Low Memory Detector" daemon prio=10 tid=0x00000000535f5000 
>>>>> nid=0x2259
>>>>> runnable [0x0000000000000000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>
>>>>> "CompilerThread1" daemon prio=10 tid=0x00000000535f1800 nid=0x2258 
>>>>> waiting
>>>>> on condition [0x0000000000000000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>
>>>>> "CompilerThread0" daemon prio=10 tid=0x00000000535ef000 nid=0x2257 
>>>>> waiting
>>>>> on condition [0x0000000000000000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>
>>>>> "Signal Dispatcher" daemon prio=10 tid=0x00000000535ec800 nid=0x2256
>>>>> waiting on condition [0x0000000000000000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>>
>>>>> "Finalizer" daemon prio=10 tid=0x00000000535cf800 nid=0x2255 in
>>>>> Object.wait() [0x0000000041219000]
>>>>> java.lang.Thread.State: WAITING (on object monitor)
>>>>> at java.lang.Object.wait(Native Method)
>>>>> - waiting on <0x00002aaade3472f0> (a 
>>>>> java.lang.ref.ReferenceQueue$Lock)
>>>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>>>>> - locked <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock)
>>>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>>>>> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>>>>>
>>>>> "Reference Handler" daemon prio=10 tid=0x00000000535c8000 
>>>>> nid=0x2254 in
>>>>> Object.wait() [0x0000000041118000]
>>>>> java.lang.Thread.State: WAITING (on object monitor)
>>>>> at java.lang.Object.wait(Native Method)
>>>>> - waiting on <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
>>>>> at java.lang.Object.wait(Object.java:485)
>>>>> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>>>>> - locked <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
>>>>>
>>>>> "main" prio=10 tid=0x0000000053554000 nid=0x2245 runnable
>>>>> [0x0000000040208000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>> at java.io.FileInputStream.readBytes(Native Method)
>>>>> at java.io.FileInputStream.read(FileInputStream.java:199)
>>>>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>> - locked <0x00002aaade1e5870> (a java.io.BufferedInputStream)
>>>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>> - locked <0x00002aaade1e29f8> (a java.io.BufferedInputStream)
>>>>> at
>>>>> sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453)

>>>>>
>>>>> at
>>>>> sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123)

>>>>>
>>>>> at
>>>>> sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118)

>>>>>
>>>>> at
>>>>> sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114)

>>>>>
>>>>> at
>>>>> sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:171)

>>>>>
>>>>> - locked <0x00002aaade1e2500> (a sun.security.provider.SecureRandom)
>>>>> at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
>>>>> - locked <0x00002aaade1e2830> (a java.security.SecureRandom)
>>>>> at java.security.SecureRandom.next(SecureRandom.java:455)
>>>>> at java.util.Random.nextLong(Random.java:284)
>>>>> at
>>>>> org.mortbay.jetty.servlet.HashSessionIdManager.doStart(HashSessionIdManager.java:139)

>>>>>
>>>>> at
>>>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)

>>>>>
>>>>> - locked <0x00002aaade1e21c0> (a java.lang.Object)
>>>>> at
>>>>> org.mortbay.jetty.servlet.AbstractSessionManager.doStart(AbstractSessionManager.java:168)

>>>>>
>>>>> at
>>>>> org.mortbay.jetty.servlet.HashSessionManager.doStart(HashSessionManager.java:67)

>>>>>
>>>>> at
>>>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)

>>>>>
>>>>> - locked <0x00002aaade334c00> (a java.lang.Object)
>>>>> at
>>>>> org.mortbay.jetty.servlet.SessionHandler.doStart(SessionHandler.java:115)

>>>>>
>>>>> at
>>>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)

>>>>>
>>>>> - locked <0x00002aaade334b18> (a java.lang.Object)
>>>>> at
>>>>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)

>>>>>
>>>>> at
>>>>> org.mortbay.jetty.handler.ContextHandler.startContext(ContextHandler.java:537)

>>>>>
>>>>> at org.mortbay.jetty.servlet.Context.startContext(Context.java:136)
>>>>> at
>>>>> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1234)

>>>>>
>>>>> at
>>>>> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)

>>>>>
>>>>> at 
>>>>> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:460)

>>>>>
>>>>> at
>>>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)

>>>>>
>>>>> - locked <0x00002aaade334ab0> (a java.lang.Object)
>>>>> at
>>>>> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)

>>>>>
>>>>> at
>>>>> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)

>>>>>
>>>>> at
>>>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)

>>>>>
>>>>> - locked <0x00002aaade332c30> (a java.lang.Object)
>>>>> at
>>>>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)

>>>>>
>>>>> at org.mortbay.jetty.Server.doStart(Server.java:222)
>>>>> at
>>>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)

>>>>>
>>>>> - locked <0x00002aaab44191a0> (a java.lang.Object)
>>>>> at org.apache.hadoop.http.HttpServer.start(HttpServer.java:460)
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:375)

>>>>>
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)

>>>>>
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)

>>>>>
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)

>>>>>
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)

>>>>>
>>>>> at
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

>>>>>
>>>>>
>>>>> "VM Thread" prio=10 tid=0x00000000535c1800 nid=0x2253 runnable
>>>>>
>>>>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000005355e000 
>>>>> nid=0x2246
>>>>> runnable
>>>>>
>>>>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000053560000 
>>>>> nid=0x2247
>>>>> runnable
>>>>>
>>>>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000053561800 
>>>>> nid=0x2248
>>>>> runnable
>>>>>
>>>>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000053563800 
>>>>> nid=0x2249
>>>>> runnable
>>>>>
>>>>> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000053565800 
>>>>> nid=0x224a
>>>>> runnable
>>>>>
>>>>> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000053567000 
>>>>> nid=0x224b
>>>>> runnable
>>>>>
>>>>> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000053569000 
>>>>> nid=0x224c
>>>>> runnable
>>>>>
>>>>> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000005356b000 
>>>>> nid=0x224d
>>>>> runnable
>>>>>
>>>>> "GC task thread#8 (ParallelGC)" prio=10 tid=0x000000005356c800 
>>>>> nid=0x224e
>>>>> runnable
>>>>>
>>>>> "GC task thread#9 (ParallelGC)" prio=10 tid=0x000000005356e800 
>>>>> nid=0x224f
>>>>> runnable
>>>>>
>>>>> "GC task thread#10 (ParallelGC)" prio=10 tid=0x0000000053570800 
>>>>> nid=0x2250
>>>>> runnable
>>>>>
>>>>> "GC task thread#11 (ParallelGC)" prio=10 tid=0x0000000053572000 
>>>>> nid=0x2251
>>>>> runnable
>>>>>
>>>>> "GC task thread#12 (ParallelGC)" prio=10 tid=0x0000000053574000 
>>>>> nid=0x2252
>>>>> runnable
>>>>>
>>>>> "VM Periodic Task Thread" prio=10 tid=0x00002aaaf835f800 nid=0x225b
>>>>> waiting on condition
>>>>>
>>>>> JNI global references: 715
>>>>>
>>>>> Heap
>>>>> PSYoungGen      total 5312K, used 5185K [0x00002aaaddde0000,
>>>>> 0x00002aaade5a0000, 0x00002aaaf2b30000)
>>>>> eden space 4416K, 97% used
>>>>> [0x00002aaaddde0000,0x00002aaade210688,0x00002aaade230000)
>>>>> from space 896K, 100% used
>>>>> [0x00002aaade320000,0x00002aaade400000,0x00002aaade400000)
>>>>> to   space 960K, 0% used
>>>>> [0x00002aaade230000,0x00002aaade230000,0x00002aaade320000)
>>>>> PSOldGen        total 5312K, used 1172K [0x00002aaab4330000,
>>>>> 0x00002aaab4860000, 0x00002aaaddde0000)
>>>>> object space 5312K, 22% used
>>>>> [0x00002aaab4330000,0x00002aaab44550b8,0x00002aaab4860000)
>>>>> PSPermGen       total 21248K, used 13354K [0x00002aaaaef30000,
>>>>> 0x00002aaab03f0000, 0x00002aaab4330000)
>>>>> object space 21248K, 62% used
>>>>> [0x00002aaaaef30000,0x00002aaaafc3a818,0x00002aaab03f0000)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Brian Bockelman wrote:
>>>>>
>>>>>
>>>>>> Hey Nick,
>>>>>>
>>>>>> I believe the mailing list stripped out your attachment.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> Here is the dump. I looked it over and unfortunately it is pretty
>>>>>>> meaningless to me at this point. Any help deciphering it would

>>>>>>> be greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> I have also now disabled the IB interface on my 2 test systems,
>>>>>>> unfortunately that had no impact.
>>>>>>>
>>>>>>> -Nick
>>>>>>>
>>>>>>>
>>>>>>> Todd Lipcon wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi Nick,
>>>>>>>>
>>>>>>>> Figure out the pid of the DataNode process using either "jps"
or
>>>>>>>> straight
>>>>>>>> "ps auxw | grep DataNode", and then kill -QUIT <pid>.
That 
>>>>>>>> should cause
>>>>>>>> the
>>>>>>>> node to dump its stack to its stdout. That'll either end
up in 
>>>>>>>> the .out
>>>>>>>> file
>>>>>>>> in your logs directory, or on your console, depending how
you 
>>>>>>>> started
>>>>>>>> the
>>>>>>>> daemon.
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Mon, Sep 28, 2009 at 9:11 PM, Nick Rathke <nick@sci.utah.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Todd,
>>>>>>>>
>>>>>>>>> Unfortunately it never returns. Gives good info on a
running 
>>>>>>>>> node.
>>>>>>>>>
>>>>>>>>> -bash-3.2# curl http://127.0.0.1:50075/stacks
>>>>>>>>>
>>>>>>>>> If I do a stop-all on the master I get
>>>>>>>>>
>>>>>>>>> curl: (52) Empty reply from server
>>>>>>>>>
>>>>>>>>> on the stuck node.
>>>>>>>>>
>>>>>>>>> If I do this in a browser I can see that it is **trying**
to 
>>>>>>>>> connect,
>>>>>>>>> if I
>>>>>>>>> kill the java process I get "Server not found" but as
long as 
>>>>>>>>> the java
>>>>>>>>> process are running I just get a black page.
>>>>>>>>>
>>>>>>>>> Should I try a TCP dump and see if I can see packets
flowing ? 
>>>>>>>>> would
>>>>>>>>> that
>>>>>>>>> be of any help ?
>>>>>>>>>
>>>>>>>>> -Nick
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Todd Lipcon wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Nick,
>>>>>>>>>
>>>>>>>>>> Can you curl http://127.0.0.1:50075/stacks on one
of the 
>>>>>>>>>> stuck nodes
>>>>>>>>>> and
>>>>>>>>>> paste the result?
>>>>>>>>>>
>>>>>>>>>> Sometimes that can give an indication as to where
things are 
>>>>>>>>>> getting
>>>>>>>>>> stuck.
>>>>>>>>>>
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 28, 2009 at 7:21 PM, Nick Rathke <nick@sci.utah.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> FYI I get the same hanging behavior if I follow the
Hadoop quick
>>>>>>>>>>
>>>>>>>>>>> start
>>>>>>>>>>> for
>>>>>>>>>>> a single node base line configuration ( no modified
conf files)
>>>>>>>>>>>
>>>>>>>>>>> -Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Brian Bockelman wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hey Nicke,
>>>>>>>>>>>
>>>>>>>>>>>> Do you have any error messages appearing
in the log files?
>>>>>>>>>>>>
>>>>>>>>>>>> Brian
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 28, 2009, at 2:06 PM, Nick Rathke
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Ted Dunning wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think that the last time you asked this
question, the 
>>>>>>>>>>>> suggestion
>>>>>>>>>>>>
>>>>>>>>>>>>> was
>>>>>>>>>>>>>
>>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>> look at DNS and make sure that everything
is exactly 
>>>>>>>>>>>>>> correct in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> net-boot
>>>>>>>>>>>>>> configuration.  Hadoop is very sensitive
to network 
>>>>>>>>>>>>>> routing and
>>>>>>>>>>>>>> naming
>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> a) in your net-boot, how are IP addresses
assigned?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We assign static IP's based on a
node's MAC address via 
>>>>>>>>>>>>>> DHCP so
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when a node is netbooted or booted
with a local OS it 
>>>>>>>>>>>>>> gets the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> same IP
>>>>>>>>>>>>> and
>>>>>>>>>>>>> hostname.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> b) how are DNS names propagated?
>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster DNS names are on a mixed
in with our facility DNS
>>>>>>>>>>>>>> servers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All nodes have proper forward and
reverse DNS lookups.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> c) how have you guaranteed that (a) and
(b) are exactly
>>>>>>>>>>>>>
>>>>>>>>>>>>>> consistent?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Host MAC address. I also have manually
conformed this.
>>>>>>>>>>>>>>      d) how have your guaranteed
that every node can talk to
>>>>>>>>>>>>>> every
>>>>>>>>>>>>>> other node
>>>>>>>>>>>>>> both by name and IP address?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Local cluster DNS / DHCP + all nodes
have all other nodes 
>>>>>>>>>>>>>> host
>>>>>>>>>>>>>> names
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and IP's in /etc/hosts. I have compared
all the config 
>>>>>>>>>>>>>> files for
>>>>>>>>>>>>>>
>>>>>>>>>>>>> DNS /
>>>>>>>>>>>>> DHCP
>>>>>>>>>>>>> / and /etc/hosts to make sure all information
is the same.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> e) have you assured yourself that any
reverse mapping that 
>>>>>>>>>>>>> exists
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> correct?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, and tested.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One more bit of information. The
system boots on a 1Gb 
>>>>>>>>>>>>>> network
>>>>>>>>>>>>>>
>>>>>>>>>>>>> all
>>>>>>>>>>>>> other
>>>>>>>>>>>>> network traffic i.e. MPI and NFS to data
volumes is via IB.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The IB network also has proper forward/backwards
DNS 
>>>>>>>>>>>>> entries. IB
>>>>>>>>>>>>> IP
>>>>>>>>>>>>> address are setup at boot time via a
script that takes the 
>>>>>>>>>>>>> host IP
>>>>>>>>>>>>> and
>>>>>>>>>>>>> a
>>>>>>>>>>>>> fixed offset to calculate the address
for the IB 
>>>>>>>>>>>>> interface. I have
>>>>>>>>>>>>> also
>>>>>>>>>>>>> confirmed that the IB IP address's match
our DNS .
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 9:45 AM, Nick
Rathke 
>>>>>>>>>>>>> <nick@sci.utah.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am hopping that someone can help with
this issue. I have 
>>>>>>>>>>>>> a 64
>>>>>>>>>>>>>
>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster that we would like to run
Hadoop on, most of the 
>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> netbooted
>>>>>>>>>>>>>>> via NFS.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hadoop runs fine on nodes IF
the node uses a local OS 
>>>>>>>>>>>>>>> install,
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>> work when nodes are netbooted.
Under netboot I can see 
>>>>>>>>>>>>>>> that the
>>>>>>>>>>>>>>> slaves
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> the correct Java processes running,
but the Hadoop web 
>>>>>>>>>>>>>>> pages
>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>> shows the
>>>>>>>>>>>>>>> nodes as available. The Hadoop
logs on the nodes also 
>>>>>>>>>>>>>>> show that
>>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>>> is running and started up correctly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On the few node that have a local
OS installed 
>>>>>>>>>>>>>>> everything works
>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>> fine
>>>>>>>>>>>>>>> and I can run the test jobs without
issue (so far).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I  am using the identical hadoop
install and configuration
>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>> netbooted nodes and none netbooted
nodes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Has anyone encountered this type
of issue ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Nick Rathke
>>>>>>>>>>>>> Scientific Computing and Imaging Institute
>>>>>>>>>>>>> Sr. Systems Administrator
>>>>>>>>>>>>> nick@sci.utah.edu
>>>>>>>>>>>>> www.sci.utah.edu
>>>>>>>>>>>>> 801-587-9933
>>>>>>>>>>>>> 801-557-3832
>>>>>>>>>>>>>
>>>>>>>>>>>>> "I came I saw I made it possible" Royal
Bliss - Here They 
>>>>>>>>>>>>> Come
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>
>>>>>>> -- 
>>>>>>> Nick Rathke
>>>>>>> Scientific Computing and Imaging Institute
>>>>>>> Sr. Systems Administrator
>>>>>>> nick@sci.utah.edu
>>>>>>> www.sci.utah.edu
>>>>>>> 801-587-9933
>>>>>>> 801-557-3832
>>>>>>>
>>>>>>> "I came I saw I made it possible" Royal Bliss - Here They Come
>>>>>>>
>>>>>>>
>>>>>>
>>>>> -- 
>>>>> Nick Rathke
>>>>> Scientific Computing and Imaging Institute
>>>>> Sr. Systems Administrator
>>>>> nick@sci.utah.edu
>>>>> www.sci.utah.edu
>>>>> 801-587-9933
>>>>> 801-557-3832
>>>>>
>>>>> "I came I saw I made it possible" Royal Bliss - Here They Come
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>> -- 
>> Nick Rathke
>> Scientific Computing and Imaging Institute
>> Sr. Systems Administrator
>> nick@sci.utah.edu
>> www.sci.utah.edu
>> 801-587-9933
>> 801-557-3832
>>
>> "I came I saw I made it possible" Royal Bliss - Here They Come
>


-- 
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
nick@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832

"I came I saw I made it possible" Royal Bliss - Here They Come 


Mime
View raw message