hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Jungblut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-420) Generate random data for Pagerank example
Date Tue, 16 Aug 2011 06:54:27 GMT

    [ https://issues.apache.org/jira/browse/HAMA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085570#comment-13085570

Thomas Jungblut commented on HAMA-420:

Oh yes. That is the dataset nutch is using right?:D
Thanks ChiaHung!

> Generate random data for Pagerank example
> -----------------------------------------
>                 Key: HAMA-420
>                 URL: https://issues.apache.org/jira/browse/HAMA-420
>             Project: Hama
>          Issue Type: New Feature
>          Components: examples
>            Reporter: Thomas Jungblut
> As stated in comment on whirrs jira:
> https://issues.apache.org/jira/browse/WHIRR-355
> We should generate a big file (1-5gb?) for PageRank example. We wanted to add this as
a part of the contrib, but we skipped/lost it somehow.
> I started crawling several pages, starting from google news. But then my free Amazon
EC2 qouta expired and had to stop the crawl.
> > We need some cloud to crawl
> > We need a place to make the data available
> The stuff we need is already coded here: 
> http://code.google.com/p/hama-shortest-paths/source/browse/#svn%2Ftrunk%2Fhama-gsoc%2Fsrc%2Fde%2Fjungblut%2Fcrawl
> Afterwards a m/r processing job in the subpackage "processing" has to be run on the output
of the crawler. This job takes care that the adjacency matrix is valid.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message