The problem is too many duplicate edges but the fact that as the algorithm
grows the number of edges adjacent to the minim vertexes increases. In a
worst case scenario the graph is completely connected and only has a single
component. So far I have not figured out an efficient method of partitioning
the edges of a single vertex while still passing on the minimum vertex to
each partition.
The twitter dataset has approximately 44 million vertexes. Even with a graph
this large the time necessary to process 44 million edges isn't too long.
One advantage to the sorting approach is that duplicate edges become trivial
to detect.
Once we are able to process the twitter dataset we can work on further
optimizations.
Thanks,
Neal.
On Thu, Jun 10, 2010 at 5:31 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Would a combiner help?
>
> On Thu, Jun 10, 2010 at 5:27 PM, Neal Clark <nclark@uvic.ca> wrote:
>
> > Using this approach it should be possible to use do the bulk of the
> > algorithm in the Mapper and then use the SecondarySort Reducer to sort
> the
> > output.
> >
>
