On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei <Wei.Peng@xerox.com> wrote:
> ...
> Currently, most of the matrix data (graph matrix, documentword matrix)
> that we are dealing with are sparse.
>
Good.
> The matrix decomposition often needs many iterations to converge, then
> intermediate results have to be saved to serve as the input for the next
> iteration.
>
I think you are thinking of the wrong algorithms. The ones that I am
talking about converge
in a fixed and small number of steps. See
https://issues.apache.org/jira/browse/MAHOUT376 for the work
in progress on this.
> This is super inefficient. As I mentioned, the BFS algorithm written in
> python took 1 second to run, however, hadoop took almost 3 hours. I
> would expect hadoop to be slower, but not this slow.
>
I think you have a combination of factors here. But, even accounting for
having too many
iterations in your BFS algorithm, iterations with stock Hadoop take 10s of
seconds even if
they do nothing. If you arrange your computation to need many iterations,
it will be slow.
All the hadoop applications that I saw are all very simple calculations,
> I wonder how it can be applied to machine learning/data mining
> algorithms.
>
Check out Mahout. There is a lot of machine learning going on there both on
Hadoop and using other scalable methods.
> Is HAMA the only way to solve it? If it is not ready to use yet, then
> can I assume hadoop is not a good solution for multiple iteration
> algorithms now?
>
I don't see much evidence that HAMA will ever solve anything, so I wouldn't
recommend pinning your hopes on that.
For fast, iterative mapreduce, you really need to keep your mappers and
reducers live between iterations. Check out
twister for that: http://www.iterativemapreduce.org/
