incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Jungblut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-420) Generate random data for Pagerank example
Date Thu, 18 Aug 2011 20:33:28 GMT

    [ https://issues.apache.org/jira/browse/HAMA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087265#comment-13087265
] 

Thomas Jungblut commented on HAMA-420:
--------------------------------------

This file made me nuts.

The XML was totally malformed and contained some chars that are not allowed, even the DMOZ
parser of nutch failed on it. 

However I extracted the URLs via regex. 
It resulted in 7133283 vertices, that is quite cool. But they had lots of duplicate hosts,
so I'll decided to deduplicate them. Resulting in 2440000 vertices which is enough.

Final statistics:

NumOfVertices: 2442507
EdgeCounter: 32282149
Size: ~680mb (682.624.440  bytes)

The partitioning takes a bit of a time (way too long), but I'm working on a MR job that should
parallize this task.

Funny fail of this evening:
I forgot to make new lines while creating the file. So the partitioner filled up with memory
because BufferedReader.readLine() never triggered :D Time to go to bed. :)

Anyone want to host this file?
Otherwise I'm going to put this up to the google code repository along with the list of the
SSSP example.

I have not testet it yet, so the file can differ later..

> Generate random data for Pagerank example
> -----------------------------------------
>
>                 Key: HAMA-420
>                 URL: https://issues.apache.org/jira/browse/HAMA-420
>             Project: Hama
>          Issue Type: New Feature
>          Components: examples
>            Reporter: Thomas Jungblut
>
> As stated in comment on whirrs jira:
> https://issues.apache.org/jira/browse/WHIRR-355
> We should generate a big file (1-5gb?) for PageRank example. We wanted to add this as
a part of the contrib, but we skipped/lost it somehow.
> I started crawling several pages, starting from google news. But then my free Amazon
EC2 qouta expired and had to stop the crawl.
> > We need some cloud to crawl
> > We need a place to make the data available
> The stuff we need is already coded here: 
> http://code.google.com/p/hama-shortest-paths/source/browse/#svn%2Ftrunk%2Fhama-gsoc%2Fsrc%2Fde%2Fjungblut%2Fcrawl
> Afterwards a m/r processing job in the subpackage "processing" has to be run on the output
of the crawler. This job takes care that the adjacency matrix is valid.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message