incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Inci Cetindil <icetin...@gmail.com>
Subject Re: PageRankBenchmrk fails due to IllegalStateException
Date Sat, 03 Dec 2011 07:47:46 GMT
Hi Avery,

I finally succeeded running the benchmark. The problem was not the port; but the IP resolving.

After removing the mapping from 127.0.0.1  to the node names on /etc/hosts files, it worked
like a charm!  I guess Hadoop has different code path to get what IP it should listen on;
so normal Hadoop jobs worked with the previous network configuration.

Thanks for your help!
Inci

On Dec 2, 2011, at 11:06 AM, Avery Ching wrote:

> You can actually set the starting RPC port to change it from 30000 by adding the appropriate
configuration  (i.e. hadoop jar giraph-0.70-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark
 -Dgiraph.rpcInitialPort=<your starting port> -e 1 -s 3 -v -V 500 -w 5).
> 
> I think I would ensure that those ports are open for communication between on node in
your cluster to another .  I don't think that anyone else has run into this problem yet...
> 
> Since the job does take some time to fail, you might want to start it up and then try
to telnet to its rpc port from another machine in the cluster and see if that succeeds.
> 
> Hope that helps,
> 
> Avery
> 
> On 12/1/11 11:04 PM, Inci Cetindil wrote:
>> I have tried it with various numbers of workers and it only worked with 1 worker.
>> 
>> I am not running multiple Giraph jobs at the same time, does it always use the ports
30000 and up? I checked the used ports using "netstat" command and didn't see any of the ports
30000-30005.
>> 
>> Inci
>> 
>> On Dec 1, 2011, at 7:03 PM, Avery Ching wrote:
>> 
>>> Hmmm...this is unusual.  I wonder if it is tired to the weird number of tasks
you are getting.  Can you try it with various numbers of workers (i.e. 1, 2) and see if it
works?
>>> 
>>> To me, the connection refused error indicates that perhaps the server failed
to bind to its port (are you running multiple Giraph jobs at the same time) or the server
died?
>>> 
>>> Avery
>>> 
>>> On 12/1/11 5:33 PM, Inci Cetindil wrote:
>>>> I am sure the machines can communicate to each other and the ports are not
blocked. I can run word count hadoop job without any problem on these machines. My hadoop
version is 0.20.203.0.
>>>> 
>>>> Thanks,
>>>> Inci
>>>> 
>>>> On Dec 1, 2011, at 3:57 PM, Avery Ching wrote:
>>>> 
>>>>> Thanks for the logs.  I see a lot of issues like the following:
>>>>> 
>>>>> 2011-12-01 00:04:46,241 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 0 time(s).
>>>>> 2011-12-01 00:04:47,243 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 1 time(s).
>>>>> 2011-12-01 00:04:48,245 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 2 time(s).
>>>>> 2011-12-01 00:04:49,247 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 3 time(s).
>>>>> 2011-12-01 00:04:50,249 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 4 time(s).
>>>>> 2011-12-01 00:04:51,251 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 5 time(s).
>>>>> 2011-12-01 00:04:52,253 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 6 time(s).
>>>>> 2011-12-01 00:04:53,255 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 7 time(s).
>>>>> 2011-12-01 00:04:54,256 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 8 time(s).
>>>>> 2011-12-01 00:04:55,258 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: rainbow-01/192.168.100.1:30004. Already tried 9 time(s).
>>>>> 2011-12-01 00:04:55,261 WARN org.apache.giraph.comm.BasicRPCCommunications:
connectAllRPCProxys:     Failed on attempt 0 of 5 to connect to (id=0,cur=Worker(hostname=rainbow-01,
MRpartition=4, port=30004),prev=null,ckpt_file=null)
>>>>> java.net.ConnectException: Call to rainbow-01/192.168.100.1:30004 failed
on connection exception: java.net.ConnectException: Connection refused
>>>>> 
>>>>> Are you sure that your machines can communicate to each other?  Are the
ports 30000 and up blocked?  And you're right, you should have only had 6 tasks.  What version
of Hadoop is this on?
>>>>> 
>>>>> Avery
>>>>> 
>>>>> On 12/1/11 2:43 PM, Inci Cetindil wrote:
>>>>>> Hi Avery,
>>>>>> 
>>>>>> I attached the logs for the first attemps. The weird thing is even
if I specified the number of workers as 5, I had 8 mapper tasks. You can see the logs for
tasks 6 and 7 failed immediately. Do you have any explanation for this behavior? Normally
I should have 6 tasks, right?
>>>>>> 
>>>>>> Thanks,
>>>>>> Inci
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Dec 1, 2011, at 11:00 AM, Avery Ching wrote:
>>>>>> 
>>>>>>> Hi Inci,
>>>>>>> 
>>>>>>> I am not sure what's wrong.  I ran the exact same command on
a freshly checked version of Graph without any trouble.  Here's my output:
>>>>>>> 
>>>>>>> hadoop jar target/giraph-0.70-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark
-e 1 -s 3 -v -V 500 -w 5
>>>>>>> Using org.apache.giraph.benchmark.PageRankBenchmark$PageRankVertex
>>>>>>> 11/12/01 10:58:05 WARN bsp.BspOutputFormat: checkOutputSpecs:
ImmutableOutputCommiter will not check anything
>>>>>>> 11/12/01 10:58:05 INFO mapred.JobClient: Running job: job_201112011054_0003
>>>>>>> 11/12/01 10:58:06 INFO mapred.JobClient:  map 0% reduce 0%
>>>>>>> 11/12/01 10:58:23 INFO mapred.JobClient:  map 16% reduce 0%
>>>>>>> 11/12/01 10:58:35 INFO mapred.JobClient:  map 100% reduce 0%
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient: Job complete: job_201112011054_0003
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient: Counters: 31
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Job Counters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=77566
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Total time spent
by all reduces waiting after reserving slots (ms)=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Total time spent
by all maps waiting after reserving slots (ms)=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Launched map tasks=6
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Giraph Timers
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Total (milliseconds)=13468
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 3 (milliseconds)=41
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Setup (milliseconds)=11691
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Shutdown (milliseconds)=73
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Vertex input superstep
(milliseconds)=369
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 0 (milliseconds)=674
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 2 (milliseconds)=519
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 1 (milliseconds)=96
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Giraph Stats
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate edges=500
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Superstep=4
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Last checkpointed
superstep=2
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Current workers=5
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Current master task
partition=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Sent messages=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate finished
vertices=500
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate vertices=500
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   File Output Format
Counters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Bytes Written=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   FileSystemCounters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     FILE_BYTES_READ=590
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     HDFS_BYTES_READ=264
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=129240
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=55080
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   File Input Format
Counters
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Bytes Read=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:   Map-Reduce Framework
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Map input records=6
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Spilled Records=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     Map output records=0
>>>>>>> 11/12/01 10:58:40 INFO mapred.JobClient:     SPLIT_RAW_BYTES=264
>>>>>>> 
>>>>>>> 
>>>>>>> Would it be possible to send me the logs from the first attempts
for every map task?
>>>>>>> 
>>>>>>> i.e. from
>>>>>>> Task attempt_201111302343_0002_m_000000_0
>>>>>>> Task attempt_201111302343_0002_m_000001_0
>>>>>>> Task attempt_201111302343_0002_m_000002_0
>>>>>>> Task attempt_201111302343_0002_m_000003_0
>>>>>>> Task attempt_201111302343_0002_m_000004_0
>>>>>>> Task attempt_201111302343_0002_m_000005_0
>>>>>>> 
>>>>>>> I think that could help us find the issue.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Avery
>>>>>>> 
>>>>>>> On 12/1/11 1:17 AM, Inci Cetindil wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I'm running PageRank benchmark example on a cluster with
1 master + 5 slave nodes. I have tried it with a large number of vertices; when I failed I
decided to make it run with 500 vertices and 5 workers first.  However, it doesn't work even
for 500 vertices.
>>>>>>>> I am using the latest version of Giraph from the trunk and
running the following command:
>>>>>>>> 
>>>>>>>> hadoop jar giraph-0.70-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark
-e 1 -s 3 -v -V 500 -w 5
>>>>>>>> 
>>>>>>>> I attached the error message that I am receiving. Please
let me know if I am missing something.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Inci
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
> 


Mime
View raw message