hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Behroz Sikander <behro...@gmail.com>
Subject Re: Groomserer BSPPeerChild limit
Date Sun, 28 Jun 2015 22:34:29 GMT
To figure out the issue, I was trying something else and found out another
wiered issue. Might be a bug of Hama but I am not sure. Both following
lines give an exception.

System.out.println( peer.getPeerName(0)); //Exception

System.out.println( peer.getNumPeers()); //Exception


[time] ERROR bsp.BSPTask: *Error running bsp setup and bsp function.*

[time]java.lang.*RuntimeException: All peer names could not be retrieved!*

at
org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.getAllPeerNames(ZooKeeperSyncClientImpl.java:305)

at org.apache.hama.bsp.BSPPeerImpl.initPeerNames(BSPPeerImpl.java:544)

at org.apache.hama.bsp.BSPPeerImpl.getNumPeers(BSPPeerImpl.java:538)

at testHDFS.EVADMMBsp.setup*(EVADMMBsp.java:58)*

at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170)

at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)

at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243)

On Sun, Jun 28, 2015 at 6:45 PM, Behroz Sikander <behroz89@gmail.com> wrote:

> I think I have more information on the issue. I did some debugging and
> found something quite strange.
>
> If I open my job with 6 tasks ( 3 tasks will run on MACHINE1 and 3 task
> will be opened on other MACHINE2),
>
>  -  3 tasks on Machine1 are frozen and the strange thing is that the
> processes do not even enter the SETUP function of BSP class. I have print
> statements in the setup function of BSP class and it doesn't print
> anything. I get empty files with zero size.
>
> drwxrwxr-x  2 behroz behroz 4096 Jun 28 16:29 .
> drwxrwxr-x 99 behroz behroz 4096 Jun 28 16:28 ..
> -rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
> attempt_201506281624_0001_000000_0.err
> -rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
> attempt_201506281624_0001_000000_0.log
> -rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
> attempt_201506281624_0001_000001_0.err
> -rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
> attempt_201506281624_0001_000001_0.log
> -rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
> attempt_201506281624_0001_000002_0.err
> -rw-rw-r--  1 behroz behroz    0 Jun 28 16:24
> attempt_201506281624_0001_000002_0.log
>
> - On MACHINE2, the code enters the SETUP function of BSP class and prints
> stuff. See the size of files generated on output. How is it possible that
> in 3 tasks the code can enter BSP and in others it cannot ?
>
> drwxrwxr-x  2 behroz behroz 4096 Jun 28 16:39 .
> drwxrwxr-x 82 behroz behroz 4096 Jun 28 16:39 ..
> -rw-rw-r--  1 behroz behroz  659 Jun 28 16:39
> attempt_201506281639_0001_000003_0.err
> -rw-rw-r--  1 behroz behroz 1441 Jun 28 16:39
> attempt_201506281639_0001_000003_0.log
> -rw-rw-r--  1 behroz behroz  659 Jun 28 16:39
> attempt_201506281639_0001_000004_0.err
> -rw-rw-r--  1 behroz behroz 1368 Jun 28 16:39
> attempt_201506281639_0001_000004_0.log
> -rw-rw-r--  1 behroz behroz  659 Jun 28 16:39
> attempt_201506281639_0001_000005_0.err
> -rw-rw-r--  1 behroz behroz 1441 Jun 28 16:39
> attempt_201506281639_0001_000005_0.log
>
> - Hama Groom log file on MACHINE2 (which is frozen) shows.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> 'attempt_201506281639_0001_000001_0' has started.
> [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> 'attempt_201506281639_0001_000002_0' has started.
> [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> 'attempt_201506281639_0001_000000_0' has started.
>
> - Hama Groom log file on MACHINE2 shows
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> 'attempt_201506281639_0001_000003_0' has started.
> [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> 'attempt_201506281639_0001_000004_0' has started.
> [time] INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> 'attempt_201506281639_0001_000005_0' has started.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> attempt_201506281639_0001_000004_0 is *done*.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> attempt_201506281639_0001_000003_0 is *done*.
> [time] INFO org.apache.hama.bsp.GroomServer: Task
> attempt_201506281639_0001_000005_0 is *done*.
>
> Any clue what might be going wrong ?
>
> Regards,
> Behroz
>
>
>
> On Sat, Jun 27, 2015 at 1:13 PM, Behroz Sikander <behroz89@gmail.com>
> wrote:
>
>> Here is the log file from that folder
>>
>> 15/06/27 11:10:34 INFO ipc.Server: Starting Socket Reader #1 for port
>> 61001
>> 15/06/27 11:10:34 INFO ipc.Server: IPC Server Responder: starting
>> 15/06/27 11:10:34 INFO ipc.Server: IPC Server listener on 61001: starting
>> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 0 on 61001: starting
>> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 1 on 61001: starting
>> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 2 on 61001: starting
>> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 3 on 61001: starting
>> 15/06/27 11:10:34 INFO message.HamaMessageManagerImpl: BSPPeer
>> address:b178b33b16cc port:61001
>> 15/06/27 11:10:34 INFO ipc.Server: IPC Server handler 4 on 61001: starting
>> 15/06/27 11:10:34 INFO sync.ZKSyncClient: Initializing ZK Sync Client
>> 15/06/27 11:10:34 INFO sync.ZooKeeperSyncClientImpl: Start connecting to
>> Zookeeper! At b178b33b16cc/172.17.0.7:61001
>> 15/06/27 11:10:37 INFO ipc.Server: Stopping server on 61001
>> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 0 on 61001: exiting
>> 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server listener on 61001
>> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 1 on 61001: exiting
>> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 2 on 61001: exiting
>> 15/06/27 11:10:37 INFO ipc.Server: Stopping IPC Server Responder
>> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 3 on 61001: exiting
>> 15/06/27 11:10:37 INFO ipc.Server: IPC Server handler 4 on 61001: exiting
>>
>>
>> And my console shows the following ouptut. Hama is frozen right now.
>> 15/06/27 11:10:32 INFO bsp.BSPJobClient: Running job:
>> job_201506262331_0003
>> 15/06/27 11:10:35 INFO bsp.BSPJobClient: Current supersteps number: 0
>> 15/06/27 11:10:38 INFO bsp.BSPJobClient: Current supersteps number: 2
>>
>> On Sat, Jun 27, 2015 at 1:07 PM, Edward J. Yoon <edwardyoon@apache.org>
>> wrote:
>>
>>> Please check the task logs in $HAMA_HOME/logs/tasklogs folder.
>>>
>>> On Sat, Jun 27, 2015 at 8:03 PM, Behroz Sikander <behroz89@gmail.com>
>>> wrote:
>>> > Yea. I also thought that. I ran the program through eclipse with 20
>>> tasks
>>> > and it works fine.
>>> >
>>> > On Sat, Jun 27, 2015 at 1:00 PM, Edward J. Yoon <edwardyoon@apache.org
>>> >
>>> > wrote:
>>> >
>>> >> > When I run the PI example, it uses 9 tasks and runs fine. When
I
>>> run my
>>> >> > program with 3 tasks, everything runs fine. But when I increase
the
>>> tasks
>>> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand
>>> what
>>> >> can
>>> >> > go wrong.
>>> >>
>>> >> It looks like a program bug. Have you ran your program in local mode?
>>> >>
>>> >> On Sat, Jun 27, 2015 at 8:03 AM, Behroz Sikander <behroz89@gmail.com>
>>> >> wrote:
>>> >> > Hi,
>>> >> > In the current thread, I mentioned 3 issues. Issue 1 and 3 are
>>> resolved
>>> >> but
>>> >> > issue number 2 is still giving me headaches.
>>> >> >
>>> >> > My problem:
>>> >> > My cluster now consists of 3 machines. Each one of them properly
>>> >> configured
>>> >> > (Apparently). From my master machine when I start Hadoop and Hama,
>>> I can
>>> >> > see the processes started on other 2 machines. If I check the
>>> maximum
>>> >> tasks
>>> >> > that my cluster can support then I get 9 (3 tasks on each machine).
>>> >> >
>>> >> > When I run the PI example, it uses 9 tasks and runs fine. When
I
>>> run my
>>> >> > program with 3 tasks, everything runs fine. But when I increase
the
>>> tasks
>>> >> > (to 4) by using "setNumBspTask". Hama freezes. I do not understand
>>> what
>>> >> can
>>> >> > go wrong.
>>> >> >
>>> >> > I checked the logs files and things look fine. I just sometimes
get
>>> an
>>> >> > exception that hama was not able to delete the sytem directory
>>> >> > (bsp.system.dir) defined in the hama-site.xml.
>>> >> >
>>> >> > Any help or clue would be great.
>>> >> >
>>> >> > Regards,
>>> >> > Behroz Sikander
>>> >> >
>>> >> > On Thu, Jun 25, 2015 at 1:13 PM, Behroz Sikander <
>>> behroz89@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> >> Thank you :)
>>> >> >>
>>> >> >> On Thu, Jun 25, 2015 at 12:14 AM, Edward J. Yoon <
>>> edwardyoon@apache.org
>>> >> >
>>> >> >> wrote:
>>> >> >>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> You can get the maximum number of available tasks like
following
>>> code:
>>> >> >>>
>>> >> >>>     BSPJobClient jobClient = new BSPJobClient(conf);
>>> >> >>>     ClusterStatus cluster = jobClient.getClusterStatus(true);
>>> >> >>>
>>> >> >>>     // Set to maximum
>>> >> >>>     bsp.setNumBspTask(cluster.getMaxTasks());
>>> >> >>>
>>> >> >>>
>>> >> >>> On Wed, Jun 24, 2015 at 11:20 PM, Behroz Sikander <
>>> behroz89@gmail.com>
>>> >> >>> wrote:
>>> >> >>> > Hi,
>>> >> >>> > 1) Thank you for this.
>>> >> >>> > 2) Here are the images. I will look into the log files
of PI
>>> example
>>> >> >>> >
>>> >> >>> > *Result of JPS command on slave*
>>> >> >>> >
>>> >> >>>
>>> >>
>>> http://s17.postimg.org/gpwe2bbfj/Screen_Shot_2015_06_22_at_7_23_31_PM.png
>>> >> >>> >
>>> >> >>> > *Result of JPS command on Master*
>>> >> >>> >
>>> >> >>>
>>> >>
>>> http://s14.postimg.org/s9922em5p/Screen_Shot_2015_06_22_at_7_23_42_PM.png
>>> >> >>> >
>>> >> >>> > 3) In my current case, I do not have any input submitted
to the
>>> job.
>>> >> >>> During
>>> >> >>> > run time, I directly fetch data from HDFS. So, I am
looking for
>>> >> >>> something
>>> >> >>> > like BSPJob.set*Max*NumBspTask().
>>> >> >>> >
>>> >> >>> > Regards,
>>> >> >>> > Behroz
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Tue, Jun 23, 2015 at 12:57 AM, Edward J. Yoon <
>>> >> edwardyoon@apache.org
>>> >> >>> >
>>> >> >>> > wrote:
>>> >> >>> >
>>> >> >>> >> Hello,
>>> >> >>> >>
>>> >> >>> >> 1) You can get the filesystem URI from a configuration
using
>>> >> >>> >> "FileSystem fs = FileSystem.get(conf);". Of course,
the
>>> fs.defaultFS
>>> >> >>> >> property should be in hama-site.xml
>>> >> >>> >>
>>> >> >>> >>   <property>
>>> >> >>> >>     <name>fs.defaultFS</name>
>>> >> >>> >>     <value>hdfs://host1.mydomain.com:9000/</value>
>>> >> >>> >>     <description>
>>> >> >>> >>       The name of the default file system. Either
the literal
>>> string
>>> >> >>> >>       "local" or a host:port for HDFS.
>>> >> >>> >>     </description>
>>> >> >>> >>   </property>
>>> >> >>> >>
>>> >> >>> >> 2) The 'bsp.tasks.maximum' is the number of tasks
per node. It
>>> looks
>>> >> >>> >> cluster configuration issue. Please run Pi example
and look at
>>> the
>>> >> >>> >> logs for more details. NOTE: you can not attach
the images to
>>> >> mailing
>>> >> >>> >> list so I can't see it.
>>> >> >>> >>
>>> >> >>> >> 3) You can use the BSPJob.setNumBspTask(int) method.
If input
>>> is
>>> >> >>> >> provided, the number of BSP tasks is basically
driven by the
>>> number
>>> >> of
>>> >> >>> >> DFS blocks. I'll fix it to be more flexible on
HAMA-956.
>>> >> >>> >>
>>> >> >>> >> Thanks!
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> On Tue, Jun 23, 2015 at 2:33 AM, Behroz Sikander
<
>>> >> behroz89@gmail.com>
>>> >> >>> >> wrote:
>>> >> >>> >> > Hi,
>>> >> >>> >> > Recently, I moved from a single machine setup
to a 2 machine
>>> >> setup.
>>> >> >>> I was
>>> >> >>> >> > successfully able to run my job that uses
the HDFS to get
>>> data. I
>>> >> >>> have 3
>>> >> >>> >> > trivial questions
>>> >> >>> >> >
>>> >> >>> >> > 1- To access HDFS, I have to manually give
the IP address of
>>> >> server
>>> >> >>> >> running
>>> >> >>> >> > HDFS. I thought that Hama will automatically
pick from the
>>> >> >>> configurations
>>> >> >>> >> > but it does not. I am probably doing something
wrong. Right
>>> now my
>>> >> >>> code
>>> >> >>> >> work
>>> >> >>> >> > by using the following.
>>> >> >>> >> >
>>> >> >>> >> > FileSystem fs = FileSystem.get(new
>>> URI("hdfs://server_ip:port/"),
>>> >> >>> conf);
>>> >> >>> >> >
>>> >> >>> >> > 2- On my master server, when I start hama
it automatically
>>> starts
>>> >> >>> hama in
>>> >> >>> >> > the slave machine (all good). Both master
and slave are set
>>> as
>>> >> >>> >> groomservers.
>>> >> >>> >> > This means that I have 2 servers to run my
job which means
>>> that I
>>> >> can
>>> >> >>> >> open
>>> >> >>> >> > more BSPPeerChild processes. And if I submit
my jar with 3
>>> bsp
>>> >> tasks
>>> >> >>> then
>>> >> >>> >> > everything works fine. But when I move to
4 tasks, Hama
>>> freezes.
>>> >> >>> Here is
>>> >> >>> >> the
>>> >> >>> >> > result of JPS command on slave.
>>> >> >>> >> >
>>> >> >>> >> >
>>> >> >>> >> > Result of JPS command on Master
>>> >> >>> >> >
>>> >> >>> >> >
>>> >> >>> >> >
>>> >> >>> >> > You can see that it is only opening tasks
on slaves but not
>>> on
>>> >> >>> master.
>>> >> >>> >> >
>>> >> >>> >> > Note: I tried to change the bsp.tasks.maximum
property in
>>> >> >>> >> hama-default.xml
>>> >> >>> >> > to 4 but still same result.
>>> >> >>> >> >
>>> >> >>> >> > 3- I want my cluster to open as many BSPPeerChild
processes
>>> as
>>> >> >>> possible.
>>> >> >>> >> Is
>>> >> >>> >> > there any setting that can I do to achieve
that ? Or hama
>>> picks up
>>> >> >>> the
>>> >> >>> >> > values from hama-default.xml to open tasks
?
>>> >> >>> >> >
>>> >> >>> >> >
>>> >> >>> >> > Regards,
>>> >> >>> >> >
>>> >> >>> >> > Behroz Sikander
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> --
>>> >> >>> >> Best Regards, Edward J. Yoon
>>> >> >>> >>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> --
>>> >> >>> Best Regards, Edward J. Yoon
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards, Edward J. Yoon
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message