giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Kumar A <pava...@outlook.com>
Subject RE: [Solved] Giraph job hangs indefinitely and is eventually killed by JobTracker
Date Tue, 08 Apr 2014 02:56:04 GMT
Hi Vikesh,
It seems that you are trying to run benchmarks on giraph.We had a lot of improvements in 1.1.0-SNAPSHOT
- (though it is not released publicly in maven at Facebook we run all our applications on
the snapshot version)So, you can pull the latest trunk from giraph: git clone https://git-wip-us.apache.org/repos/asf/giraph.git
And then try running some applications.
[you are correct, we store hostnames-taskid mappings in the beginning of the run, so u can
see such failures]
Date: Mon, 7 Apr 2014 16:27:09 -0700
From: vikesh@stanford.edu
To: user@giraph.apache.org
Subject: [Solved] Giraph job hangs indefinitely and is eventually killed by JobTracker

Hi, 

Thanks for the help! Turns out this was happening because /etc/hosts had an outdated IP address
(dynamic) for the host that was being used as the master. Giraph was probably failing to communicate
with the master throughout and getting stuck indefinitely.
Thanks,Vikesh Khanna,
Masters, Computer Science (Class of 2015)
Stanford University


From: "Vikesh Khanna" <vikesh@stanford.edu>
To: user@giraph.apache.org
Sent: Monday, April 7, 2014 2:58:13 PM
Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker

@Pankaj, I am running the ShortestPath example on a tiny graph now (5 nodes). That is also
getting hung indefinitely the exact same way. This machine has 1 TB of memory and I have used
-Xmx25g (25 GB) 
as Java options. So hopefully it should not be because of memory limitation.  [(free/total/max)
= 1706.68M / 1979.75M / 25242.25M]

@Lukas, I am trying to run the example packaged with the Giraph installation - SimpleShortestPathsVertex.
I haven't written any code myself yet - just trying to get this to work first. I am not getting
any memory exception - no dump file is being generated at the DumpPath.
$HADOOP_HOME/bin/hadoop jar ~/.local/bin/giraph-examples.jar org.apache.giraph.GiraphRunner
-D giraph.logLevel="all" -libjars ~/.local/bin/giraph-core.jar org.apache.giraph.examples.SimpleShortestPathsVertex
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/vikesh/input/tiny_graph.txt
-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/vikesh/shortestPaths8
-ca SimpleShortestPathsVertex.source=2 -w 1
I am printing debug level logs now, and I am seeing these calls indefinitely in both the zookeeper
and worker tasks - 2014-04-07 14:45:32,325 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate
8
2014-04-07 14:45:35,326 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to
/127.0.0.1:45894 from job_201404071443_0001 sending #34
2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to
/127.0.0.1:45894 from job_201404071443_0001 got value #34
2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 2
2014-04-07 14:45:38,328 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to
/127.0.0.1:45894 from job_201404071443_0001 sending #35
2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to
/127.0.0.1:45894 from job_201404071443_0001 got value #35
2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 1
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Got timed signaled
of false
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Wait for 0
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Got timed signaled
of false
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Wait for 0These
calls go on for 10 minutes and then the job is killed by Hadoop.
Thanks,Vikesh Khanna,
Masters, Computer Science (Class of 2015)
Stanford University


From: "Lukas Nalezenec" <lukas.nalezenec@firma.seznam.cz>
To: user@giraph.apache.org
Sent: Monday, April 7, 2014 4:13:23 AM
Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker


  
    
  
  
    Hi,

      Try making and analyzing memory dump after exception (JVM
        param -XX:+HeapDumpOnOutOfMemoryError)

      What configuration (mainly Partition class) do you use ?

      Lukas

      

      On 7.4.2014 11:45, Vikesh Khanna wrote:

    
    
      
      
        Hi,
        

        
        Any ideas why Giraph waits indefinitely? I've been stuck on
          this for a long time now. 
        

        
        Thanks,
        Vikesh Khanna,

          Masters, Computer Science (Class of 2015)

          Stanford University

          

        
        

        
        
        From:
          "Vikesh Khanna" <vikesh@stanford.edu>

          To: user@giraph.apache.org

          Sent: Friday, April 4, 2014 6:06:51 AM

          Subject: Re: Giraph job hangs indefinitely and is
          eventually killed by JobTracker

          

          
          
            Hi Avery,

            
            

            
            I tried both the options. It does appear to be a GC
              problem. The problem continues with the second option as
              well :(. I have attached the logs after enabling the first
              set of options and using 1 worker. Would be very helpful
              if you can take a look. 
            

            
            This machine has 1 TB memory. We ran benchmarks of
              various other graph libraries on this machine and they
              worked fine (even with graphs 10x larger than the Giraph
              PageRank benchmark - 40 million nodes). I am sure Giraph
              would work fine as well - this should not be a resource
              constraint.  
            

            
            Thanks,
            Vikesh Khanna,

              Masters, Computer Science (Class of 2015)

              Stanford University

              

            
            

            
            
            From:
              "Avery Ching" <aching@apache.org>

              To: user@giraph.apache.org

              Sent: Thursday, April 3, 2014 7:26:56 PM

              Subject: Re: Giraph job hangs indefinitely and is
              eventually killed by JobTracker

              

              
              This is for a single worker
                it appears.  Most likely your worker went into GC and
                never returned.  You can try with GC settings turned on,
                try adding something like.

                

                -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails
                -XX:+PrintGCTimeStamps -verbose:gc 

                

                You could also try the concurrent mark/sweep collector. 
                

                

                -XX:+UseConcMarkSweepGC

                

                Any chance you can use more workers and/or get more
                memory?

                

                Avery

                

                On 4/3/14, 5:46 PM, Vikesh Khanna wrote:

              
              
                
                  @Avery,

                  
                  

                  
                  Thanks for the help. I checked out the task logs,
                    and turns out there was an exception  "GC overhead
                    limit exceeded" due to which the benchmarks wouldn't
                    even load the vertices. I got around it by
                    increasing the heap size (mapred.child.java.opts) in
                    mapred-site.xml. The benchmark is loading vertices
                    now. However, the job is still getting stuck
                    indefinitely (and eventually killed). I have
                    attached the small log for the map task on 1 worker.
                    Would really appreciate if you can help understand
                    the cause. 
                  

                  
                  Thanks,
                  Vikesh Khanna,

                    Masters, Computer Science (Class of 2015)

                    Stanford University

                    

                  
                  

                  
                  
                  From:

                    "Praveen kumar s.k" <skpraveenkumar9@gmail.com>

                    To: user@giraph.apache.org

                    Sent: Thursday, April 3, 2014 4:40:07 PM

                    Subject: Re: Giraph job hangs indefinitely
                    and is eventually killed by JobTracker

                    

                    
                    You have given -w 30, make sure that that many
                    number of map tasks are

                    configured in your cluster

                    

                    
                    On Thu, Apr 3, 2014 at 6:24 PM, Avery Ching <aching@apache.org>
                    wrote:

                    > My guess is that you don't get your resources.
                     It would be very helpful to

                    > print the master log.  You can find it when the
                    job is running to look at

                    > the Hadoop counters on the job UI page.

                    >

                    > Avery

                    >

                    >

                    > On 4/3/14, 12:49 PM, Vikesh Khanna wrote:

                    >

                    > Hi,

                    >

                    > I am running the PageRank benchmark under
                    giraph-examples from giraph-1.0.0

                    > release. I am using the following command to
                    run the job (as mentioned here)

                    >

                    > vikesh@madmax

                    >
/lfs/madmax/0/vikesh/usr/local/giraph/giraph-examples/src/main/java/org/apache/giraph/examples

                    > $ $HADOOP_HOME/bin/hadoop jar

                    >
$GIRAPH_HOME/giraph-core/target/giraph-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar

                    > org.apache.giraph.benchmark.PageRankBenchmark
                    -e 1 -s 3 -v -V 50000000 -w 30

                    >

                    >

                    > However, the job gets stuck at map 9% and is
                    eventually killed by the

                    > JobTracker on reaching the mapred.task.timeout
                    (default 10 minutes). I tried

                    > increasing the timeout to a very large value,
                    and the job went on for over 8

                    > hours without completion. I also tried the
                    ShortestPathsBenchmark, which

                    > also fails the same way.

                    >

                    >

                    > Any help is appreciated.

                    >

                    >

                    > ****** ---------------- ***********

                    >

                    >

                    > Machine details:

                    >

                    > Linux version 2.6.32-279.14.1.el6.x86_64

                    > (mockbuild@c6b8.bsys.dev.centos.org)
                    (gcc version 4.4.6 20120305 (Red Hat

                    > 4.4.6-4) (GCC) ) #1 SMP Tue Nov 6 23:43:09 UTC
                    2012

                    >

                    > Architecture: x86_64

                    > CPU op-mode(s): 32-bit, 64-bit

                    > Byte Order: Little Endian

                    > CPU(s): 64

                    > On-line CPU(s) list: 0-63

                    > Thread(s) per core: 1

                    > Core(s) per socket: 8

                    > CPU socket(s): 8

                    > NUMA node(s): 8

                    > Vendor ID: GenuineIntel

                    > CPU family: 6

                    > Model: 47

                    > Stepping: 2

                    > CPU MHz: 1064.000

                    > BogoMIPS: 5333.20

                    > Virtualization: VT-x

                    > L1d cache: 32K

                    > L1i cache: 32K

                    > L2 cache: 256K

                    > L3 cache: 24576K

                    > NUMA node0 CPU(s): 1-8

                    > NUMA node1 CPU(s): 9-16

                    > NUMA node2 CPU(s): 17-24

                    > NUMA node3 CPU(s): 25-32

                    > NUMA node4 CPU(s): 0,33-39

                    > NUMA node5 CPU(s): 40-47

                    > NUMA node6 CPU(s): 48-55

                    > NUMA node7 CPU(s): 56-63

                    >

                    >

                    > I am using a pseudo-distributed Hadoop cluster
                    on a single machine with

                    > 64-cores.

                    >

                    >

                    > *****-------------*******

                    >

                    >

                    > Thanks,

                    > Vikesh Khanna,

                    > Masters, Computer Science (Class of 2015)

                    > Stanford University

                    >

                    >

                    >

                  
                  

                  
                
              
              

            
            

            
          
        
        

        
      
    
    

  



 		 	   		  
Mime
View raw message