hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devajyoti Sarkar" <dsar...@q-kk.com>
Subject Re: Page Ranking, Hadoop And MPI.
Date Wed, 16 Apr 2008 04:46:35 GMT
Hi Ted,

I am really interested in learning more about the symmetry in power
law/fractal graphs in general and how matrix math (SVD, etc.) allows us to
detect/exploit it. Would you have any recommendations for books or papers
that I could read which explores this question. I remember doing a standard
Linear Algebra course way back in college but it never discussed its links
to graphs/symmetry...

Thanks and best regards,

On Wed, Apr 16, 2008 at 2:33 AM, Ted Dunning <tdunning@veoh.com> wrote:

> Power law algorithms are ideal for this kind of parallelized problem.
> The basic idea is that hub and authority style algorithms are intimately
> related to eigenvector or singular value decompositions (depending on
> whether the links are symmetrical).  This also means that there is a close
> relationship to asymptotic beahavior of random walks on the graph.
> If you represent the linkage in the web by a matrix that has columns
> representing source page and rows representing the target page and with a
> 1
> where-ever the source page has a link pointing to the target page, then if
> you start with a vector with a single non-zero element equal to 1 as a
> representation of a single page, then multiplying by the linkage matrix
> will
> give you a vector with 1 in the positions corresponding to the pages the
> original page linked to.  If you multiply again, you get all the pages
> that
> you can get to in two steps from the original page.
> Mathematically, if we call the original vector x and the linkage matrix A,
> the pages that x links to are just Ax.   The pages that are two steps from
> x
> are A(Ax) = A^2 x.
> The eigenvector decomposition of A is just a way of writing A as a product
> of three matrices:
>    A = U S U'
> U' is the transpose of U, and U has the special property that U'U = I (it
> is
> called ortho-normal because of this).
> S is a diagonal matrix.
> There is lots of deep mathematical machinery and beautiful symmetry
> available here, but for now we can just take this as given.
> The pages n steps from x are
>   x_n = A^n x = (U S U')^n x = (U S U')^n-2 (U S U') (U S U') x
>       = (U S U')^(n-2) (U S (U'U) S U') x = (U S U')^(n-2) (U S^2 U') x
>       = U S^n U' x
> This is really cool because S^n can be computed by just taking each
> diagonal
> element and raising it to a power.
> Eigenvector decompositions have other, really deep connections.  For
> instance,  if you take the elements of S (call the i-th one s_i) then
>    sum_i s_i^n
> Is the number of paths that are n steps long.
> Connected (or nearly connected) clusters of pages can also be derived from
> the eigenvector decomposition.  This is the basis of so-called spectral
> clustering.  For some very impressive examples of spectral clustering see
> http://citeseer.ist.psu.edu/ng01spectral.html
> So eigenvectors are cool.  But how can we compute them?
> First, note that if A^n = U S^n U' and if some of the s_i are bigger than
> others, the big ones will quickly dominate the others.  That is pretty
> quickly, A^n \approx u_1 s_1^n u_1'.  This means that we can compute an
> approximation of u_1 by just doing A^n x where x is some random vector.
> Moreover, we can compute u_2 by starting with a different random vector
> and
> iterating the same way, but with an additional step where we forbid the
> result from going towards u_1.  With just a few additional wrinkles, this
> gives us what is called the Lanczos algorithm.  Golub and van Loan's
> excellent book Matrix Computations gives a lot of information on these
> algorithms.
> The cool thing here is that our random vector can represent a single page
> and we can approximate the final result by following links.  Following
> links
> is just a (human-readable) way of saying sparse matrix multiplication.  If
> we do this multiplication against lots of different random starting
> points,
> we can quickly build parallel algorithms to compute things like page rank.
> I hope this helps.  I know it is a big byte to take at one time.
> On 4/15/08 9:31 AM, "Chaman Singh Verma" <csv610@yahoo.com> wrote:
> > Hello,
> >
> > After googling for many days, I couldn't get one answer from many of the
> > published reports on Ranking algorithm done by Google. Since Google uses
> > GFS for fault tolerance purposes, what communication libraries they
> might be
> > using to solve such a large matrix ? I presume that standard Message
> Passing
> > (MPI) may not be suitable for this purpose(Fault tolerance issue)
> >
> > Suppose we implement ranking algorithm on top of Hadoop, what could be
> > the best way/best distributed algorithm/library etc ?
> >
> > With regards.
> >
> > Chaman Singh Verma
> > Poona, India
> >
> >
> >
> >
> >  between 0000-00-00 and 9999-99-99

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message