hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahesh Babu <jmb...@gmail.com>
Subject Re: Hama graph BSPJobClient Job failing - not able to identify the reason.
Date Tue, 10 Sep 2013 05:17:09 GMT
Hi Anastasis,

I bumped up my RAM size for the VM from 1GB to 2GB when I increased graph
size from 160k nodes to 240k nodes. It worked then.  It worked even for
380k nodes.
Now I am at 2GB.
But for 1500k bumping memory size did-not help. I increased 2GB to 4GB and
then to 8GB.. my host machine has sufficient memory. But still no help Same
error. Debug logs or error logs are not giving indicator what should I be
looking at.

However, I get below debug errors in the standard output. Earlier, I
thought I could ignore them based on some other email thread. Is the below
debug error something we need to take seriously?. Please note small size
graphs run without any problem and I do not get this below error when I run
small graphs. For large graph size .e. 1500k  Its the same error/failure.
Not able to understand what the error means.

13/09/10 09:39:48 DEBUG bsp.BSPJobClient: part-00000
13/09/10 09:39:48 INFO bsp.BSPJobClient: Running job: job_201309100911_0003
13/09/10 09:39:51 INFO bsp.BSPJobClient: Current supersteps number: 0
*attempt_201309100911_0003_000000_0: 13/09/10 09:39:48 DEBUG
bsp.GroomServer: BSPPeerChild starting
attempt_201309100911_0003_000000_0: 13/09/10 09:39:48 DEBUG
conf.Configuration: java.io.IOException: config()
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:227)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:214)
attempt_201309100911_0003_000000_0:     at
org.apache.hama.HamaConfiguration.<init>(HamaConfiguration.java:33)
attempt_201309100911_0003_000000_0:     at
org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1223)
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
conf.Configuration: java.io.IOException: config()
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:227)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:214)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:466)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:369)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
attempt_201309100911_0003_000000_0:     at
org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1231)
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
security.Groups:  Creating new Groups object
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=300000
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
conf.Configuration: java.io.IOException: config()
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:227)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:214)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.KerberosName.<clinit>(KerberosName.java:79)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
*
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:466)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:369)
attempt_201309100911_0003_000000_0:     at
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
attempt_201309100911_0003_000000_0:     at
org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1231)
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
security.UserGroupInformation: hadoop login
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
security.UserGroupInformation: hadoop login commit
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
security.UserGroupInformation: using local user:UnixPrincipal: ubuntu
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG
security.UserGroupInformation: UGI loginUser:ubuntu
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client: The
ping interval is60000ms.
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client: Use
SIMPLE authentication for protocol BSPPeerProtocol
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client:
Connecting to localhost/127.0.0.1:37851
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client: IPC
Client (47) connection to localhost/127.0.0.1:37851 from ubuntu sending #0
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client: IPC
Client (47) connection to localhost/127.0.0.1:37851 from ubuntu: starting,
having connections 1
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client: IPC
Client (47) connection to localhost/127.0.0.1:37851 from ubuntu got value #0
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.RPC: Call:
getProtocolVersion 60
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client: IPC
Client (47) connection to localhost/127.0.0.1:37851 from ubuntu sending #1
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.Client: IPC
Client (47) connection to localhost/127.0.0.1:37851 from ubuntu got value #1
attempt_201309100911_0003_000000_0: 13/09/10 09:39:49 DEBUG ipc.RPC: Call:
getTask 14

Indicators from HDFS logs below:  there is no errors/warning reported in
hadoop/hdfs logs.

Regards,
Mahesh Babu



On Fri, Sep 6, 2013 at 10:49 PM, Anastasis Andronidis <
andronat_asf@hotmail.com> wrote:

> Hi,
>
> it might be a memory issue as you suggest. Could you also check the logs
> from HDFS?
>
> Also, can you run the program with a smaller input?
>
> Anastasis
>
> On 6 Σεπ 2013, at 8:03 μ.μ., Mahesh Babu <jmbabu@gmail.com> wrote:
>
> > Hi Anastasis,
> >
> > Yes I am able to run in standalone(local) mode. I had to increase the RAM
> > size from 1GB to 2GB for the VM that runs the standalone program. The
> topo
> > size is approx. 1500k vertices.
> >
> > When in pseudo mode, the job didnot run as it is. I had to increase max
> bsp
> > tasks from 4 to 5 (previous error). After increasing it to 5, it proceeds
> > further and then fails like above log.
> >
> > Thanks,
> > Mahesh Babu
> >
> >
> > On Fri, Sep 6, 2013 at 3:25 PM, Anastasis Andronidis <
> > andronat_asf@hotmail.com> wrote:
> >
> >> Hello,
> >>
> >> can you run your code on standalone mode so you can be sure that the
> >> problem is not on your code?
> >>
> >> Kindly,
> >> Anastasis
> >>
> >> On 6 Σεπ 2013, at 12:35 μ.μ., Mahesh Babu <jmbabu@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> When I run a hama job in pseudo distributed mode (single node) I get
> >>> following error: (in stdout)
> >>>>>>>>>>>>>>>
> >>> attempt_201309061315_0005_000000_0: 13/09/06 14:01:39 DEBUG
> >>> fs.FSInputChecker: DFSClient readChunk got seqno 593 offsetInBlock
> >> 38862848
> >>> lastPacketInBlock false packetLen 66052
> >>> attempt_201309061315_0005_000000_0: 13/09/06 14:01:39 DEBUG
> >>> fs.FSInputChecker: DFSClient readChunk got seqno 594 offsetInBlock
> >> 38928384
> >>> lastPacketInBlock false packetLen 66052
> >>> attempt_201309061315_0005_000000_0: 13/09/06 14:01:40 DEBUG
> >>> fs.FSInputChecker: DFSClient readChunk got seqno 595 offsetInBlock
> >> 38993920
> >>> lastPacketInBlock false packetLen 66052
> >>> attempt_201309061315_0005_000000_0: 13/09/06 14:01:40 DEBUG
> >>> fs.FSInputChecker: DFSClient readC
> >>> *13/09/06 14:03:29 INFO bsp.BSPJobClient: Job failed.*
> >>> <<<<<<<<<<<<
> >>>
> >>>
> >>> *hama-ubuntu-bspmaster-ubuntu.log*
> >>>>>>>>>>>>>>>
> >>> 2013-09-06 14:03:21,422 DEBUG org.apache.hama.bsp.Counters: Adding
> >>> SUPERSTEP_SUM
> >>> 2013-09-06 14:03:23,423 DEBUG org.apache.hama.bsp.Counters: Adding
> >>> SUPERSTEP_SUM
> >>> 2013-09-06 14:03:25,424 DEBUG org.apache.hama.bsp.Counters: Adding
> >>> SUPERSTEP_SUM
> >>> *2013-09-06 14:03:25,425 INFO org.apache.hama.bsp.JobInProgress: Taskid
> >>> 'attempt_201309061315_0005_000000_0' has failed.
> >>> 2013-09-06 14:03:25,425 INFO org.apache.hama.bsp.TaskInProgress: Task
> >>> 'task_201309061315_0005_000000' has failed.
> >>> *2013-09-06 14:03:25,425 DEBUG org.apache.hama.bsp.JobInProgress:
> >> Removing
> >>> /tmp/hadoop-ubuntu/bsp/local/bspMaster/job_201309061315_0005.xml and
> >>> /tmp/hadoop-ubuntu/bsp/local/bspMaster/job_201309061315_0005.jar
> >> getJobFile
> >>> =
> >>
> hdfs://localhost:9000/tmp/hadoop-ubuntu*/bsp/system/submit_714o6m/job.xml
> >>> 2013-09-06 14:03:25,434 INFO org.apache.hama.bsp.JobInProgress: Job
> >> failed.
> >>> 2013-09-06 14:03:25,434 DEBUG org.apache.hama.bsp.JobInProgress:
> Removing
> >>> null and null getJobFile =
> >>>
> hdfs://localhost:9000/tmp/hadoop-ubuntu/bsp/system/submit_714o6m/job.xml
> >>> *<<<<<<<<<<<<<
> >>>
> >>> *hama-ubuntu-groom-ubuntu.log*
> >>>>>>>>>>>>>>>
> >>> 2013-09-06 14:03:14,660 DEBUG org.apache.hama.bsp.GroomServer: checking
> >>> task: attempt_201309061315_0005_000000_0 starttime =1378456254247
> >> lastping
> >>> = 1378456334727 run state = RUNNING monitorPeriod = 10000 check = false
> >>> 2013-09-06 14:03:24,660 DEBUG org.apache.hama.bsp.GroomServer: checking
> >>> task: attempt_201309061315_0005_000000_0 starttime =1378456254247
> >> lastping
> >>> = 1378456334727 run state = RUNNING monitorPeriod = 10000 check = true
> >>> 2013-09-06 14:03:24,660 INFO org.apache.hama.bsp.GroomServer: adding
> >> purge
> >>> task: attempt_201309061315_0005_000000_0
> >>> 2013-09-06 14:03:24,660 DEBUG org.apache.hama.bsp.GroomServer: Got 1
> >>> oblivious tasks
> >>> 2013-09-06 14:03:24,661 DEBUG org.apache.hama.bsp.GroomServer: Purging
> >> task
> >>> org.apache.hama.bsp.GroomServer$TaskInProgress@2e0cd499
> >>> *2013-09-06 14:03:24,661 INFO org.apache.hama.bsp.GroomServer: About to
> >>> purge task: attempt_201309061315_0005_000000_0
> >>> 2013-09-06 14:03:24,661 DEBUG org.apache.hama.bsp.GroomServer: Killing
> >>> process for attempt_201309061315_0005_000000_0
> >>> 2013-09-06 14:03:25,436 DEBUG org.apache.hama.bsp.GroomServer: Got
> >> Response
> >>> from BSPMaster with 1 actions
> >>> 2013-09-06 14:03:25,437 INFO org.apache.hama.bsp.GroomServer: Kill 1
> >> tasks.
> >>> *<<<<<<<<<<<<<
> >>>
> >>> *attempt_201309061315_0005_000000_0.log*
> >>>>>>>>>>>>>>>
> >>> 13/09/06 14:02:06 DEBUG ipc.RPC: Call: ping 2
> >>> 13/09/06 14:02:07 DEBUG fs.FSInputChecker: DFSClient readChunk got
> seqno
> >>> 633 offsetInBlock 41484288 lastPacketInBlock false packetLen 66052
> >>> 13/09/06 14:02:14 DEBUG bsp.BSPTask: Pinging at time 1378456334726
> >>> 13/09/06 14:02:14 DEBUG ipc.Client: IPC Client (47) connection to
> >> localhost/
> >>> 127.0.0.1:49551 from ubuntu sending #24
> >>> 13/09/06 14:02:14 DEBUG ipc.Client: IPC Client (47) connection to
> >> localhost/
> >>> 127.0.0.1:49551 from ubuntu got value #24
> >>> 13/09/06 14:02:14 DEBUG ipc.RPC: Call: ping 2
> >>> 13/09/06 14:02:37 DEBUG bsp.BSPTask: Pinging at time 1378456357688
> >>> 13/09/06 14:02:37 DEBUG ipc.Client: The ping interval is60000ms.
> >>> 13/09/06 14:02:38 DEBUG ipc.Client: Use SIMPLE authentication for
> >> protocol
> >>> BSPPeerProtocol
> >>> 13/09/06 14:02:39 DEBUG ipc.Client: Connecting to localhost/
> >> 127.0.0.1:49551
> >>> 13/09/06 14:02:56 DEBUG ipc.Client: The ping interval is60000ms.
> >>> 13/09/06 14:02:56 DEBUG ipc.Client: Use SIMPLE authentication for
> >> protocol
> >>> ClientProtocol
> >>> 13/09/06 14:02:57 DEBUG ipc.Client: Connecting to localhost/
> >> 127.0.0.1:9000
> >>> 13/09/06 14:02:58 DEBUG ipc.Client: IPC Client (47) connection to
> >> localhost/
> >>> 127.0.0.1:49551 from ubuntu: closed
> >>> 13/09/06 14:02:59 DEBUG ipc.Client: IPC Client (47) connection to
> >> localhost/
> >>> 127.0.0.1:49551 from ubuntu: stopped, remaining connections 1
> >>>>>>>>>>>>>>>
> >>>
> >>> Any idea why job is failing. No exceptions or failures in any logs even
> >>> when I put the logs in DEBUG mode.
> >>>
> >>> Thanks,
> >>> Mahesh Babu
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message