mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Isabel Drost <>
Subject Re: TU Berlin Winter of Code Project
Date Tue, 10 Nov 2009 11:17:41 GMT
On Fri Ted Dunning <> wrote:

> The question that I don't see addressed is whether you choose to use
> a fully streaming approach as is done in Bixo or whether you will use
> a document repository approach as is more common in most search
> engines.

I guess even when using a streaming approach a repository for temporary
results is necessary to decouple those stages that are expensive and
hard to reproduce. E.g. crawling to HBase and reading the results from
there for further processing should prevent failures in post processing
resulting in having to rerun the crawl. Most likely there are more of
these points further down the processing chain as well.


View raw message