mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Algorithm scalability
Date Wed, 05 May 2010 16:30:28 GMT
The canonical algorithms, certainly all that I know of, all compute
recommendations as a function, generally, of all the input data.
They're not inherently distributable, no.

I think all can be reimagined as a distributed process, with enough
care. The output remains a function of all data, however. The
distributed form is slower, and so to compute one recommendation, as a
function of all data, when data is huge, and distribute it, is
infeasibly slow.

So, generally you are looking at computing lots of recommendations at
once, perhaps all of them, in order to amortize the overhead. And when
you are doing all the work at once, the distributed process can
actually be pretty efficient.

For example the co-occurrence based distributed recommender is really
just a simplistic item-based recommender. You can see how much the
form and characteristics change in the translation.

On Wed, May 5, 2010 at 3:59 PM, First Qaxy <> wrote:
> Out of curiosity - sorry if this have been answered before - would it be possible to
combine the two approaches so you could break the data set in batches that could fit in memory
and use a non-distributed algorithm to provide results for each batch and then use Hadoop
to merge the results in a sensible way? This would improve performance while scaling (this
is different than the pseudo approach where you simply distribute the work on the same model).
I didn't give it much though but I think this might work some limited cases.

View raw message