Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 98454 invoked from network); 29 Sep 2009 18:27:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Sep 2009 18:27:38 -0000 Received: (qmail 59818 invoked by uid 500); 29 Sep 2009 18:27:36 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 59750 invoked by uid 500); 29 Sep 2009 18:27:36 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 59740 invoked by uid 99); 29 Sep 2009 18:27:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Sep 2009 18:27:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=NORMAL_HTTP_TO_IP,SPF_PASS,UNPARSEABLE_RELAY,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [155.98.58.79] (HELO sci.utah.edu) (155.98.58.79) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Sep 2009 18:27:25 +0000 Received: from dummy.name; Tue, 29 Sep 2009 12:27:04 -0600 Message-ID: <4AC25178.1070102@sci.utah.edu> Date: Tue, 29 Sep 2009 12:27:04 -0600 From: Nick Rathke User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302) MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: Running Hadoop on cluster with NFS booted systems References: <4AC0E830.7030304@sci.utah.edu> <9C84A9C2-F360-401F-8E52-B87088F05475@cse.unl.edu> <4AC16F39.1080509@sci.utah.edu> <45f85f70909282058h7b3b6207s46615821868f4340@mail.gmail.com> <4AC188F8.1060409@sci.utah.edu> <45f85f70909282134s792ec7b8t265c2f31492059b1@mail.gmail.com> <4AC22618.5080701@sci.utah.edu> <4AC2352C.50402@sci.utah.edu> <89B9B6F8-DEFA-4757-9FFD-FC4CE985C568@cse.unl.edu> <45f85f70909291043t3e2e1aa9x9e5ebcbf18c64297@mail.gmail.com> In-Reply-To: <45f85f70909291043t3e2e1aa9x9e5ebcbf18c64297@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Great. I'll look at this fix. Here is what I got based on Brian's info lsof -p gave me : java 12739 root 50r CHR 1,8 3335 /dev/random java 12739 root 51r CHR 1,9 3325 /dev/urandom . . . . java 12739 root 66r CHR 1,8 3335 /dev/random Both do exist in /dev and securerandom.source=file was already set to securerandom.source=file:/dev/urandom I have also checked that the permissions on said file are the same between nfs nodes and local OS nodes. -Nick Todd Lipcon wrote: > Yep, this is a common problem. The fix that Brian outlined helps a lot, but > if you are *really* strapped for random bits, you'll still block. This is > because even if you've set the random source, it still uses the real > /dev/random to grab a seed for the prng, at least on my system. > > On systems where I know I don't care about true randomness, I also use this > trick: > > http://www.chrissearle.org/blog/technical/increase_entropy_26_kernel_linux_box > > It's very handy for boxes running hudson that start and stop multinode > pseudodistributed hadoop clusters regularly. > > -Todd > > On Tue, Sep 29, 2009 at 10:16 AM, Brian Bockelman wrote: > > >> Hey Nick, >> >> Strange. It appears that the Jetty server has stalled while trying to read >> from /dev/random. Is it possible that some part of /dev isn't initialized >> before the datanode is launched? >> >> Can you confirm this using "lsof -p " ? >> >> I copy/paste a solution I found in a forum via google below. >> >> Brian >> >> Edit $JAVA_HOME/jre/lib/security/java.security and change the property: >> >> securerandom.source=file:/dev/random >> >> to: >> >> securerandom.source=file:/dev/urandom >> >> >> On Sep 29, 2009, at 11:26 AM, Nick Rathke wrote: >> >> Thanks. Here it is as in all of it's glory... >> >>> -Nick >>> >>> >>> 2009-09-29 09:15:53 >>> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode): >>> >>> "263851830@qtp0-1" prio=10 tid=0x00002aaaf846a000 nid=0x226b in >>> Object.wait() [0x0000000041d24000] >>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> - waiting on <0x00002aaade3587f8> (a >>> org.mortbay.thread.QueuedThreadPool$PoolThread) >>> at >>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565) >>> - locked <0x00002aaade3587f8> (a >>> org.mortbay.thread.QueuedThreadPool$PoolThread) >>> >>> "1837007962@qtp0-0" prio=10 tid=0x00002aaaf84d4000 nid=0x226a in >>> Object.wait() [0x0000000041b22000] >>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> - waiting on <0x00002aaade3592b8> (a >>> org.mortbay.thread.QueuedThreadPool$PoolThread) >>> at >>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565) >>> - locked <0x00002aaade3592b8> (a >>> org.mortbay.thread.QueuedThreadPool$PoolThread) >>> >>> "refreshUsed-/tmp/hadoop-root/dfs/data" daemon prio=10 >>> tid=0x00002aaaf8456000 nid=0x2269 waiting on condition [0x0000000041c23000] >>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> at java.lang.Thread.sleep(Native Method) >>> at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80) >>> at java.lang.Thread.run(Thread.java:619) >>> >>> "RMI TCP Accept-0" daemon prio=10 tid=0x00002aaaf834d800 nid=0x225a >>> runnable [0x000000004171e000] >>> java.lang.Thread.State: RUNNABLE >>> at java.net.PlainSocketImpl.socketAccept(Native Method) >>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) >>> - locked <0x00002aaade358040> (a java.net.SocksSocketImpl) >>> at java.net.ServerSocket.implAccept(ServerSocket.java:453) >>> at java.net.ServerSocket.accept(ServerSocket.java:421) >>> at >>> sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34) >>> at >>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369) >>> at >>> sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341) >>> at java.lang.Thread.run(Thread.java:619) >>> >>> "Low Memory Detector" daemon prio=10 tid=0x00000000535f5000 nid=0x2259 >>> runnable [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "CompilerThread1" daemon prio=10 tid=0x00000000535f1800 nid=0x2258 waiting >>> on condition [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "CompilerThread0" daemon prio=10 tid=0x00000000535ef000 nid=0x2257 waiting >>> on condition [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "Signal Dispatcher" daemon prio=10 tid=0x00000000535ec800 nid=0x2256 >>> waiting on condition [0x0000000000000000] >>> java.lang.Thread.State: RUNNABLE >>> >>> "Finalizer" daemon prio=10 tid=0x00000000535cf800 nid=0x2255 in >>> Object.wait() [0x0000000041219000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> - waiting on <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock) >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) >>> - locked <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock) >>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) >>> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) >>> >>> "Reference Handler" daemon prio=10 tid=0x00000000535c8000 nid=0x2254 in >>> Object.wait() [0x0000000041118000] >>> java.lang.Thread.State: WAITING (on object monitor) >>> at java.lang.Object.wait(Native Method) >>> - waiting on <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock) >>> at java.lang.Object.wait(Object.java:485) >>> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) >>> - locked <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock) >>> >>> "main" prio=10 tid=0x0000000053554000 nid=0x2245 runnable >>> [0x0000000040208000] >>> java.lang.Thread.State: RUNNABLE >>> at java.io.FileInputStream.readBytes(Native Method) >>> at java.io.FileInputStream.read(FileInputStream.java:199) >>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317) >>> - locked <0x00002aaade1e5870> (a java.io.BufferedInputStream) >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317) >>> - locked <0x00002aaade1e29f8> (a java.io.BufferedInputStream) >>> at >>> sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453) >>> at >>> sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123) >>> at >>> sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118) >>> at >>> sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114) >>> at >>> sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:171) >>> - locked <0x00002aaade1e2500> (a sun.security.provider.SecureRandom) >>> at java.security.SecureRandom.nextBytes(SecureRandom.java:433) >>> - locked <0x00002aaade1e2830> (a java.security.SecureRandom) >>> at java.security.SecureRandom.next(SecureRandom.java:455) >>> at java.util.Random.nextLong(Random.java:284) >>> at >>> org.mortbay.jetty.servlet.HashSessionIdManager.doStart(HashSessionIdManager.java:139) >>> at >>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >>> - locked <0x00002aaade1e21c0> (a java.lang.Object) >>> at >>> org.mortbay.jetty.servlet.AbstractSessionManager.doStart(AbstractSessionManager.java:168) >>> at >>> org.mortbay.jetty.servlet.HashSessionManager.doStart(HashSessionManager.java:67) >>> at >>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >>> - locked <0x00002aaade334c00> (a java.lang.Object) >>> at >>> org.mortbay.jetty.servlet.SessionHandler.doStart(SessionHandler.java:115) >>> at >>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >>> - locked <0x00002aaade334b18> (a java.lang.Object) >>> at >>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) >>> at >>> org.mortbay.jetty.handler.ContextHandler.startContext(ContextHandler.java:537) >>> at org.mortbay.jetty.servlet.Context.startContext(Context.java:136) >>> at >>> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1234) >>> at >>> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517) >>> at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:460) >>> at >>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >>> - locked <0x00002aaade334ab0> (a java.lang.Object) >>> at >>> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) >>> at >>> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) >>> at >>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >>> - locked <0x00002aaade332c30> (a java.lang.Object) >>> at >>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) >>> at org.mortbay.jetty.Server.doStart(Server.java:222) >>> at >>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) >>> - locked <0x00002aaab44191a0> (a java.lang.Object) >>> at org.apache.hadoop.http.HttpServer.start(HttpServer.java:460) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:375) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) >>> >>> "VM Thread" prio=10 tid=0x00000000535c1800 nid=0x2253 runnable >>> >>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000005355e000 nid=0x2246 >>> runnable >>> >>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000053560000 nid=0x2247 >>> runnable >>> >>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000053561800 nid=0x2248 >>> runnable >>> >>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000053563800 nid=0x2249 >>> runnable >>> >>> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000053565800 nid=0x224a >>> runnable >>> >>> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000053567000 nid=0x224b >>> runnable >>> >>> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000053569000 nid=0x224c >>> runnable >>> >>> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000005356b000 nid=0x224d >>> runnable >>> >>> "GC task thread#8 (ParallelGC)" prio=10 tid=0x000000005356c800 nid=0x224e >>> runnable >>> >>> "GC task thread#9 (ParallelGC)" prio=10 tid=0x000000005356e800 nid=0x224f >>> runnable >>> >>> "GC task thread#10 (ParallelGC)" prio=10 tid=0x0000000053570800 nid=0x2250 >>> runnable >>> >>> "GC task thread#11 (ParallelGC)" prio=10 tid=0x0000000053572000 nid=0x2251 >>> runnable >>> >>> "GC task thread#12 (ParallelGC)" prio=10 tid=0x0000000053574000 nid=0x2252 >>> runnable >>> >>> "VM Periodic Task Thread" prio=10 tid=0x00002aaaf835f800 nid=0x225b >>> waiting on condition >>> >>> JNI global references: 715 >>> >>> Heap >>> PSYoungGen total 5312K, used 5185K [0x00002aaaddde0000, >>> 0x00002aaade5a0000, 0x00002aaaf2b30000) >>> eden space 4416K, 97% used >>> [0x00002aaaddde0000,0x00002aaade210688,0x00002aaade230000) >>> from space 896K, 100% used >>> [0x00002aaade320000,0x00002aaade400000,0x00002aaade400000) >>> to space 960K, 0% used >>> [0x00002aaade230000,0x00002aaade230000,0x00002aaade320000) >>> PSOldGen total 5312K, used 1172K [0x00002aaab4330000, >>> 0x00002aaab4860000, 0x00002aaaddde0000) >>> object space 5312K, 22% used >>> [0x00002aaab4330000,0x00002aaab44550b8,0x00002aaab4860000) >>> PSPermGen total 21248K, used 13354K [0x00002aaaaef30000, >>> 0x00002aaab03f0000, 0x00002aaab4330000) >>> object space 21248K, 62% used >>> [0x00002aaaaef30000,0x00002aaaafc3a818,0x00002aaab03f0000) >>> >>> >>> >>> >>> Brian Bockelman wrote: >>> >>> >>>> Hey Nick, >>>> >>>> I believe the mailing list stripped out your attachment. >>>> >>>> Brian >>>> >>>> On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote: >>>> >>>> Hi, >>>> >>>>> Here is the dump. I looked it over and unfortunately it is pretty >>>>> meaningless to me at this point. Any help deciphering it would be greatly >>>>> appreciated. >>>>> >>>>> I have also now disabled the IB interface on my 2 test systems, >>>>> unfortunately that had no impact. >>>>> >>>>> -Nick >>>>> >>>>> >>>>> Todd Lipcon wrote: >>>>> >>>>> >>>>>> Hi Nick, >>>>>> >>>>>> Figure out the pid of the DataNode process using either "jps" or >>>>>> straight >>>>>> "ps auxw | grep DataNode", and then kill -QUIT . That should cause >>>>>> the >>>>>> node to dump its stack to its stdout. That'll either end up in the .out >>>>>> file >>>>>> in your logs directory, or on your console, depending how you started >>>>>> the >>>>>> daemon. >>>>>> >>>>>> -Todd >>>>>> >>>>>> On Mon, Sep 28, 2009 at 9:11 PM, Nick Rathke >>>>>> wrote: >>>>>> >>>>>> >>>>>> Hi Todd, >>>>>> >>>>>>> Unfortunately it never returns. Gives good info on a running node. >>>>>>> >>>>>>> -bash-3.2# curl http://127.0.0.1:50075/stacks >>>>>>> >>>>>>> If I do a stop-all on the master I get >>>>>>> >>>>>>> curl: (52) Empty reply from server >>>>>>> >>>>>>> on the stuck node. >>>>>>> >>>>>>> If I do this in a browser I can see that it is **trying** to connect, >>>>>>> if I >>>>>>> kill the java process I get "Server not found" but as long as the java >>>>>>> process are running I just get a black page. >>>>>>> >>>>>>> Should I try a TCP dump and see if I can see packets flowing ? would >>>>>>> that >>>>>>> be of any help ? >>>>>>> >>>>>>> -Nick >>>>>>> >>>>>>> >>>>>>> >>>>>>> Todd Lipcon wrote: >>>>>>> >>>>>>> >>>>>>> Hi Nick, >>>>>>> >>>>>>>> Can you curl http://127.0.0.1:50075/stacks on one of the stuck nodes >>>>>>>> and >>>>>>>> paste the result? >>>>>>>> >>>>>>>> Sometimes that can give an indication as to where things are getting >>>>>>>> stuck. >>>>>>>> >>>>>>>> -Todd >>>>>>>> >>>>>>>> On Mon, Sep 28, 2009 at 7:21 PM, Nick Rathke >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> FYI I get the same hanging behavior if I follow the Hadoop quick >>>>>>>> >>>>>>>>> start >>>>>>>>> for >>>>>>>>> a single node base line configuration ( no modified conf files) >>>>>>>>> >>>>>>>>> -Nick >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Brian Bockelman wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hey Nicke, >>>>>>>>> >>>>>>>>>> Do you have any error messages appearing in the log files? >>>>>>>>>> >>>>>>>>>> Brian >>>>>>>>>> >>>>>>>>>> On Sep 28, 2009, at 2:06 PM, Nick Rathke wrote: >>>>>>>>>> >>>>>>>>>> Ted Dunning wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think that the last time you asked this question, the suggestion >>>>>>>>>> >>>>>>>>>>> was >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>>> look at DNS and make sure that everything is exactly correct in >>>>>>>>>>>> the >>>>>>>>>>>> net-boot >>>>>>>>>>>> configuration. Hadoop is very sensitive to network routing and >>>>>>>>>>>> naming >>>>>>>>>>>> details. >>>>>>>>>>>> >>>>>>>>>>>> So, >>>>>>>>>>>> >>>>>>>>>>>> a) in your net-boot, how are IP addresses assigned? >>>>>>>>>>>> >>>>>>>>>>>> We assign static IP's based on a node's MAC address via DHCP so >>>>>>>>>>>> that >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> when a node is netbooted or booted with a local OS it gets the >>>>>>>>>>>> >>>>>>>>>>> same IP >>>>>>>>>>> and >>>>>>>>>>> hostname. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> b) how are DNS names propagated? >>>>>>>>>>> >>>>>>>>>>>> cluster DNS names are on a mixed in with our facility DNS >>>>>>>>>>>> servers. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> All nodes have proper forward and reverse DNS lookups. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> c) how have you guaranteed that (a) and (b) are exactly >>>>>>>>>>> >>>>>>>>>>>> consistent? >>>>>>>>>>>> >>>>>>>>>>>> Host MAC address. I also have manually conformed this. >>>>>>>>>>>> d) how have your guaranteed that every node can talk to >>>>>>>>>>>> every >>>>>>>>>>>> other node >>>>>>>>>>>> both by name and IP address? >>>>>>>>>>>> >>>>>>>>>>>> Local cluster DNS / DHCP + all nodes have all other nodes host >>>>>>>>>>>> names >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> and IP's in /etc/hosts. I have compared all the config files for >>>>>>>>>>>> >>>>>>>>>>> DNS / >>>>>>>>>>> DHCP >>>>>>>>>>> / and /etc/hosts to make sure all information is the same. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> e) have you assured yourself that any reverse mapping that exists >>>>>>>>>>> >>>>>>>>>>>> is >>>>>>>>>>>> correct? >>>>>>>>>>>> >>>>>>>>>>>> Yes, and tested. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> One more bit of information. The system boots on a 1Gb network >>>>>>>>>>>> >>>>>>>>>>> all >>>>>>>>>>> other >>>>>>>>>>> network traffic i.e. MPI and NFS to data volumes is via IB. >>>>>>>>>>> >>>>>>>>>>> The IB network also has proper forward/backwards DNS entries. IB >>>>>>>>>>> IP >>>>>>>>>>> address are setup at boot time via a script that takes the host IP >>>>>>>>>>> and >>>>>>>>>>> a >>>>>>>>>>> fixed offset to calculate the address for the IB interface. I have >>>>>>>>>>> also >>>>>>>>>>> confirmed that the IB IP address's match our DNS . >>>>>>>>>>> >>>>>>>>>>> -Nick >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 28, 2009 at 9:45 AM, Nick Rathke >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am hopping that someone can help with this issue. I have a 64 >>>>>>>>>>> >>>>>>>>>>>> node >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> cluster that we would like to run Hadoop on, most of the nodes >>>>>>>>>>>> >>>>>>>>>>>>> are >>>>>>>>>>>>> netbooted >>>>>>>>>>>>> via NFS. >>>>>>>>>>>>> >>>>>>>>>>>>> Hadoop runs fine on nodes IF the node uses a local OS install, >>>>>>>>>>>>> but >>>>>>>>>>>>> doesn't >>>>>>>>>>>>> work when nodes are netbooted. Under netboot I can see that the >>>>>>>>>>>>> slaves >>>>>>>>>>>>> have >>>>>>>>>>>>> the correct Java processes running, but the Hadoop web pages >>>>>>>>>>>>> never >>>>>>>>>>>>> shows the >>>>>>>>>>>>> nodes as available. The Hadoop logs on the nodes also show that >>>>>>>>>>>>> everything >>>>>>>>>>>>> is running and started up correctly. >>>>>>>>>>>>> >>>>>>>>>>>>> On the few node that have a local OS installed everything works >>>>>>>>>>>>> just >>>>>>>>>>>>> fine >>>>>>>>>>>>> and I can run the test jobs without issue (so far). >>>>>>>>>>>>> >>>>>>>>>>>>> I am using the identical hadoop install and configuration >>>>>>>>>>>>> between >>>>>>>>>>>>> netbooted nodes and none netbooted nodes. >>>>>>>>>>>>> >>>>>>>>>>>>> Has anyone encountered this type of issue ? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>> Nick Rathke >>>>>>>>>>> Scientific Computing and Imaging Institute >>>>>>>>>>> Sr. Systems Administrator >>>>>>>>>>> nick@sci.utah.edu >>>>>>>>>>> www.sci.utah.edu >>>>>>>>>>> 801-587-9933 >>>>>>>>>>> 801-557-3832 >>>>>>>>>>> >>>>>>>>>>> "I came I saw I made it possible" Royal Bliss - Here They Come >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>> >>>>> -- >>>>> Nick Rathke >>>>> Scientific Computing and Imaging Institute >>>>> Sr. Systems Administrator >>>>> nick@sci.utah.edu >>>>> www.sci.utah.edu >>>>> 801-587-9933 >>>>> 801-557-3832 >>>>> >>>>> "I came I saw I made it possible" Royal Bliss - Here They Come >>>>> >>>>> >>>> >>> -- >>> Nick Rathke >>> Scientific Computing and Imaging Institute >>> Sr. Systems Administrator >>> nick@sci.utah.edu >>> www.sci.utah.edu >>> 801-587-9933 >>> 801-557-3832 >>> >>> "I came I saw I made it possible" Royal Bliss - Here They Come >>> >>> >> > > -- Nick Rathke Scientific Computing and Imaging Institute Sr. Systems Administrator nick@sci.utah.edu www.sci.utah.edu 801-587-9933 801-557-3832 "I came I saw I made it possible" Royal Bliss - Here They Come