mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neal Clark <>
Subject Re: Distributed Graph Algorithms
Date Tue, 15 Jun 2010 11:25:49 GMT
I just wanted to provide an update. I have rewritten the connectivity
algorithm to use a sorting/streaming model. This approach eliminates the
memory bottlenecks I was running into with my previous approach.

I just finished running the new algorithm on the twitter dataset.

Nodes: 160
MapJobs: 640

Iterations: 6
Total time: 63m 32s
Average time per iteration: 10m 20s

It appears that the twitter graph is connected however I will need to better
test my algorithm before I make that claim. ;)

I have also uploaded copies of the hadoop job-tracker pages.

There is still a lot of testing and performance tuning to be done but I will
try to keep you updated on any progress.

To prevent map input files from being split is is sufficient to extend
FileInputFormat and override the isSplitable() method? In order for the
algorithm to run correctly each mapper must receive a consecutive block from
the previous reduce phase which contains all edges for a particular vertex.
That is that I need to ensure that vertexes are not being split across
multiple mappers.



On Fri, Jun 11, 2010 at 9:23 AM, Jeff Eastman <>wrote:

> On 6/10/10 6:22 PM, Ted Dunning wrote:
>> I think I understand the problem, but I haven't been reading super
>> carefully
>> as you describe your algorithm.
>> The basic idea, though, is pretty simple.  You are producing higher and
>> higher powers of the adjacency matrix while labeling each connected
>> component with the lowest component.
>> The algorithm as you described it sounds like you are keeping around the
>> entire matrix with appropriate labeling.  An alternative is to reduce the
>> matrix each time that you discover that a connected component has merged
>> with another.  That would mean that the graph you are processing would
>> decrease in size on each iteration terminating when it has a single node
>> for
>> each connected sub-graph.  The problem with that approach is remembering
>> the
>> labeling of each node in the original graph.  One way to do that fairly
>> efficiently without an explosion in the amount of data being carried
>> around
>> would be create a set of side files at each step that contain the mapping
>> from nodes in one generation to their label in the next generation.  If
>> you
>> have k steps of reduction (k \le log_2 N where N is the size of the graph,
>> I
>> think), then you would have k side mapping files.  After you converge in
>> the
>> merging process, you can run k joins to reverse the process and propagate
>> the labels from the last generation back to the first.  The first set of k
>> map-reduces should run progressively faster as the graph collapses and the
>> second set of k map-reduces in which you do the joins should run
>> progressively faster as the number of labels being processed increases.
> I've been reading this thread with interest as it may offer some
> scalability improvements to the current MeanShiftCanopy clustering
> implementation. That code currently runs a sequence of
> Canopy-clustering-like iterations to build, bottom-up, a graph of clusters
> which contain the identities of the points they have subsumed as their
> centers shift towards their local centers of maximum density and they merge
> together. These clusters can become very large objects as the list of
> subsumed ids grows and this becomes an ultimate scalability limitation. I've
> been stewing about an approach similar to what is above to write out the
> mergers in each iteration to a side-mapping file. It would run pretty much
> like the last two sentences above.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message