hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhuang Kechen <zhuangkec...@gmail.com>
Subject Re: out of memory problem...
Date Mon, 17 Sep 2012 13:28:52 GMT
HI, Thomas:
Sorry to bother you. When I run some small graph test on my cluster, a 25Mb
graph data job can be succeed, I can get the right output file on HDFS. But
the 50Mb can not. when the job fails, I got the *ZooKeeper logs end up
likes:*
*
*
2012-09-17 21:04:27,866 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid
0x239d433755a0014, likely client has closed socket
2012-09-17 21:04:32,666 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /192.168.0.2:57977 which had sessionid
0x239d433755a0014
2012-09-17 21:04:36,551 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid
0x239d433755a0013, likely client has closed socket
2012-09-17 21:04:36,989 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /192.168.0.3:44924 which had sessionid
0x239d433755a0013

*GroomServer logs likes:*
2012-09-17 21:03:37,679 INFO org.apache.hama.bsp.GroomServer: Launch 3
tasks.
2012-09-17 21:03:37,982 INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201008172027_0007_000002_0' has started.
2012-09-17 21:03:37,983 INFO org.apache.hama.bsp.GroomServer: Launch 3
tasks.
2012-09-17 21:03:38,073 INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201008172027_0007_000000_0' has started.
2012-09-17 21:03:38,074 INFO org.apache.hama.bsp.GroomServer: Launch 3
tasks.
2012-09-17 21:03:38,325 INFO org.apache.hama.bsp.GroomServer: Task
'attempt_201008172027_0007_000001_0' has started.
2012-09-17 21:04:23,161 INFO org.apache.hama.bsp.GroomServer: adding purge
task: attempt_201008172027_0007_000000_0
2012-09-17 21:04:23,513 INFO org.apache.hama.bsp.GroomServer: adding purge
task: attempt_201008172027_0007_000002_0
2012-09-17 21:04:23,513 INFO org.apache.hama.bsp.GroomServer: About to
purge task: attempt_201008172027_0007_000000_0
2012-09-17 21:04:25,918 INFO org.apache.hama.bsp.GroomServer: About to
purge task: attempt_201008172027_0007_000002_0
2012-09-17 21:04:30,707 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks.
2012-09-17 21:04:30,929 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks.
2012-09-17 21:04:30,929 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks.
2012-09-17 21:04:33,965 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks.

*Task logs end up likes:*
12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x00a3ef26, /
192.168.0.3:34203 => 627-PC/192.168.0.5:61001] INTEREST_CHANGED
12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x00a3ef26, /
192.168.0.3:34203 => 627-PC/192.168.0.5:61001] INTEREST_CHANGED
12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0057bd52, /
192.168.0.3:53962 => 624-PC/192.168.1.2:61002] INTEREST_CHANGED
12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0057bd52, /
192.168.0.3:53962 => 624-PC/192.168.1.2:61002] INTEREST_CHANGED
12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0104ae5e, /
192.168.0.3:47749 => 625-PC/192.168.0.3:61003] INTEREST_CHANGED
12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0104ae5e, /
192.168.0.3:47749 => 625-PC/192.168.0.3:61003] INTEREST_CHANGED
12/09/17 21:04:12 INFO ipc.NettyTransceiver: [id: 0x00c0499d, /
192.168.0.3:36006 => 627-PC/192.168.0.5:61003] INTEREST_CHANGED
12/09/17 21:04:12 INFO ipc.NettyTransceiver: [id: 0x00c0499d, /
192.168.0.3:36006 => 627-PC/192.168.0.5:61003] INTEREST_CHANGED
..........
Do you have any idea what may cause this kind of fail? Thanks a lot!


2012/9/15 Thomas Jungblut <thomas.jungblut@gmail.com>

> Okay I have observed this problem as well with 10gb of adjacency text file.
> I was running on a 75gb instance on EC2 with 70gigs heap, which should be
> no problem, but it fails after several steps.
> I'm profiling it now in more detail.
>
> Can't be that 10gb text use more than 20gb of heap as graph with messages.
>
> 2012/9/14 Thomas Jungblut <thomas.jungblut@gmail.com>
>
> > I would trim the spaces in the key and value.
> > If it afterwards still crashes, I have no idea anymore and would
> recommend
> > you to take a heapdump with hprof and look what is sucking all that
> memory.
> >
> > 2012/9/14 庄克琛 <zhuangkechen@gmail.com>
> >
> >> Hi, I set the property to hama-site.xml.
> >>   <property>
> >>     <name> hama.messenger.queue.class </name>
> >>     <value> org.apache.hama.bsp.message.DiskQueue </value>
> >>   </property>
> >> Am I set it right?
> >> and restart the hama,(stop-bspd.sh and start-bspd.sh), try the test job
> >> again, and watch the memory slowly up to 70%, 80%, 90%, then crash...
> >_<
> >>
> >>
> >> 2012/9/14 Thomas Jungblut <thomas.jungblut@gmail.com>
> >>
> >> > Yes, I wanted to have direct memory in Hama months ago, but hadn't
> >> managed
> >> > to find enough time.
> >> > That is a very good idea.
> >> >
> >> > 2012/9/14 Tommaso Teofili <tommaso.teofili@gmail.com>
> >> >
> >> > > I think we may also create an Apache DirectMemory based DiskQueue
> >> which
> >> > > cache things on disk but hides most of the complexity.
> >> > > My 2 cents,
> >> > > Tommaso
> >> > >
> >> > > 2012/9/14 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> > >
> >> > > > I have created an issue for that:
> >> > > > HAMA-642<https://issues.apache.org/jira/browse/HAMA-642>
> >> > > >
> >> > > > 2012/9/14 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> > > >
> >> > > > > Basically I think that the graph should fit into memory
of your
> >> task.
> >> > > > > So the messages could cause the overflow.
> >> > > > >
> >> > > > > You can try out the DiskQueue, this can be configured with
> setting
> >> > the
> >> > > > > property "hama.messenger.queue.class" to
> >> > > > > "org.apache.hama.bsp.message.DiskQueue".
> >> > > > >
> >> > > > > This will immediately flush the messages to disk. However
this
> is
> >> > > > > experimental currently, so if you try it out please tell
us if
> it
> >> > > helped.
> >> > > > >
> >> > > > > Thanks.
> >> > > > >
> >> > > > > To further scale this, we should write vertices that don't
fit
> in
> >> > > memory
> >> > > > > on the disk. I will add another jira for that soon.
> >> > > > >
> >> > > > > 2012/9/14 庄克琛 <zhuangkechen@gmail.com>
> >> > > > >
> >> > > > >> oh, the HDFS block size is 128Mb, not 64Mb, so the 73Mb
graph
> >> will
> >> > not
> >> > > > >> be split-ed on the HDFS.
> >> > > > >>
> >> > > > >> 2012/9/14 庄克琛 <zhuangkechen@gmail.com>
> >> > > > >>
> >> > > > >> > em... I have try your configure advise and restart
the hama.
> >> > > > >> >  I use the  Google web graph(
> >> > > > >> > http://wiki.apache.org/hama/WriteHamaGraphFile
),
> >> > > > >> > Nodes: 875713 Edges: 5105039, which is about 73Mb,
upload to
> a
> >> > small
> >> > > > >> HDFS
> >> > > > >> > cluster(block size is 64Mb), test the PageRank
in (
> >> > > > >> > http://wiki.apache.org/hama/WriteHamaGraphFile
), got the
> >> result
> >> > > as:
> >> > > > >> > ################
> >> > > > >> > function@624-PC:~/hadoop-1.0.3/hama-0.6.0$ hama
jar
> hama-6-P*
> >> > > > >> > input-google ouput-google
> >> > > > >> > 12/09/14 14:27:50 INFO bsp.FileInputFormat: Total
input paths
> >> to
> >> > > > >> process :
> >> > > > >> > 1
> >> > > > >> > 12/09/14 14:27:50 INFO bsp.FileInputFormat: Total
# of
> splits:
> >> 3
> >> > > > >> > 12/09/14 14:27:50 INFO bsp.BSPJobClient: Running
job:
> >> > > > >> job_201008141420_0004
> >> > > > >> > 12/09/14 14:27:53 INFO bsp.BSPJobClient: Current
supersteps
> >> > number:
> >> > > 0
> >> > > > >> > Java HotSpot(TM) Server VM warning: Attempt to
allocate stack
> >> > guard
> >> > > > >> pages
> >> > > > >> > failed.
> >> > > > >> > ###################
> >> > > > >> >
> >> > > > >> > Last time the supersteps  could be 1 or 2, then
the same
> >> result.
> >> > > > >> > the task attempt****.err files are empty.
> >> > > > >> > Is the graph too large?
> >> > > > >> > I test on a small graph, get the right Rank results
> >> > > > >> >
> >> > > > >> >
> >> > > > >> > 2012/9/14 Edward J. Yoon <edwardyoon@apache.org>
> >> > > > >> >
> >> > > > >> > I've added multi-step partitioning method to save
memory[1].
> >> > > > >> >>
> >> > > > >> >> Please try to configure below property to hama-site.xml.
> >> > > > >> >>
> >> > > > >> >>   <property>
> >> > > > >> >>     <name>hama.graph.multi.step.partitioning.interval</name>
> >> > > > >> >>     <value>10000000</value>
> >> > > > >> >>   </property>
> >> > > > >> >>
> >> > > > >> >> 1. https://issues.apache.org/jira/browse/HAMA-599
> >> > > > >> >>
> >> > > > >> >> On Fri, Sep 14, 2012 at 3:13 PM, 庄克琛
<
> zhuangkechen@gmail.com>
> >> > > wrote:
> >> > > > >> >> > HI, Actually I use this (
> >> > > > >> >> >
> >> > > > >> >>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://builds.apache.org/job/Hama-Nightly/672/artifact/.repository/org/apache/hama/hama-dist/0.6.0-SNAPSHOT/
> >> > > > >> >> > )
> >> > > > >> >> > to test again, I mean use this 0.6.0SNAPSHOT
version
> replace
> >> > > > >> everything,
> >> > > > >> >> > got the same out of memory results. I
just don't know what
> >> > cause
> >> > > > the
> >> > > > >> >> out of
> >> > > > >> >> > memory fails, only some small graph computing
can be
> >> finished.
> >> > Is
> >> > > > >> this
> >> > > > >> >> > version finished the "
> >> > > > >> >> > [HAMA-596<https://issues.apache.org/jira/browse/HAMA-596
> >> > > > >]:Optimize
> >> > > > >> >> > memory usage of graph job" ?
> >> > > > >> >> > Thanks
> >> > > > >> >> >
> >> > > > >> >> > 2012/9/14 Thomas Jungblut <thomas.jungblut@gmail.com>
> >> > > > >> >> >
> >> > > > >> >> >> Hey, what jar did you exactly replace?
> >> > > > >> >> >> Am 14.09.2012 07:49 schrieb "庄克琛"
<
> zhuangkechen@gmail.com
> >> >:
> >> > > > >> >> >>
> >> > > > >> >> >> > hi, every one:
> >> > > > >> >> >> > I use the hama-0.5.0 with the
hadoop-1.0.3, try to do
> >> some
> >> > > large
> >> > > > >> >> graphs
> >> > > > >> >> >> > analysis.
> >> > > > >> >> >> > When I test the PageRank examples,
as the (
> >> > > > >> >> >> > http://wiki.apache.org/hama/WriteHamaGraphFile)
> shows, I
> >> > > > download
> >> > > > >> >> the
> >> > > > >> >> >> > graph
> >> > > > >> >> >> > data, and run the PageRank job
on a small distributed
> >> > cluser,
> >> > > I
> >> > > > >> can
> >> > > > >> >> only
> >> > > > >> >> >> > get the out of memory failed,
with Superstep 0,1,2
> works
> >> > well,
> >> > > > >> then
> >> > > > >> >> get
> >> > > > >> >> >> the
> >> > > > >> >> >> > memory out fail.(Each computer
have 2G memory) But
> when I
> >> > test
> >> > > > >> some
> >> > > > >> >> small
> >> > > > >> >> >> > graph, everything went well.
> >> > > > >> >> >> > Also I try the trunk version(
> >> > > > >> >> >> >
> >> > > https://builds.apache.org/job/Hama-Nightly/672/changes#detail3
> >> > > > ),
> >> > > > >> >> replace
> >> > > > >> >> >> > my
> >> > > > >> >> >> > hama-0.5.0 with the hama-0.6.0-snapshot,
only get the
> >> same
> >> > > > >> results.
> >> > > > >> >> >> > Anyone got better ideas?
> >> > > > >> >> >> >
> >> > > > >> >> >> > Thanks!
> >> > > > >> >> >> >
> >> > > > >> >> >> > --
> >> > > > >> >> >> >
> >> > > > >> >> >> > *Zhuang Kechen
> >> > > > >> >> >> > *
> >> > > > >> >> >> >
> >> > > > >> >> >>
> >> > > > >> >> >
> >> > > > >> >> >
> >> > > > >> >> >
> >> > > > >> >> > --
> >> > > > >> >> >
> >> > > > >> >> > *Zhuang Kechen*
> >> > > > >> >> >
> >> > > > >> >> > School of Computer Science & Technology
> >> > > > >> >> >
> >> > > > >> >> > **
> >> > > > >> >> > Nanjing University of Science & Technology
> >> > > > >> >> >
> >> > > > >> >> > Lab.623, School of Computer Sci. &
Tech.
> >> > > > >> >> >
> >> > > > >> >> > No.200, Xiaolingwei Street
> >> > > > >> >> >
> >> > > > >> >> > Nanjing, Jiangsu, 210094
> >> > > > >> >> >
> >> > > > >> >> > P.R. China
> >> > > > >> >> >
> >> > > > >> >> > Tel: 025-84315982**
> >> > > > >> >> >
> >> > > > >> >> > Email: zhuangkechen@gmail.com
> >> > > > >> >>
> >> > > > >> >>
> >> > > > >> >>
> >> > > > >> >> --
> >> > > > >> >> Best Regards, Edward J. Yoon
> >> > > > >> >> @eddieyoon
> >> > > > >> >>
> >> > > > >> >
> >> > > > >> >
> >> > > > >> >
> >> > > > >> > --
> >> > > > >> >
> >> > > > >> > *Zhuang Kechen
> >> > > > >> > *
> >> > > > >> >
> >> > > > >> >
> >> > > > >> >
> >> > > > >>
> >> > > > >>
> >> > > > >> --
> >> > > > >>
> >> > > > >> *Zhuang Kechen*
> >> > > > >>
> >> > > > >> School of Computer Science & Technology
> >> > > > >>
> >> > > > >> **
> >> > > > >> Nanjing University of Science & Technology
> >> > > > >>
> >> > > > >> Lab.623, School of Computer Sci. & Tech.
> >> > > > >>
> >> > > > >> No.200, Xiaolingwei Street
> >> > > > >>
> >> > > > >> Nanjing, Jiangsu, 210094
> >> > > > >>
> >> > > > >> P.R. China
> >> > > > >>
> >> > > > >> Tel: 025-84315982**
> >> > > > >>
> >> > > > >> Email: zhuangkechen@gmail.com
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >>
> >> *Zhuang Kechen*
> >>
> >
> >
>



-- 

*Zhuang Kechen*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message