mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: TU Berlin Winter of Code Project
Date Tue, 10 Nov 2009 19:15:22 GMT

Also, it is very nice to be able to run the crawler again to get some new
content and then only run following stages on the changed documents.  With
something like HBase, it is pretty easy to run only on documents that have a
"needs-work" flag set.  With a streaming approach, you have to segment all
of your files according to incremental tranches and construct the job graph
on the fly according to which inputs have appeared.  It can work, but the
job graph becomes prodigiously large after a bit.  The complementary problem
with the repository approach is the proliferation and complexity of the
state indicators, especially since you want to be able to avoid scanning the
entire repository by using the column nature of the repository.  That means
you generally can't do a scan based on a last update field, but rather you
need to encode your dependencies by setting one of many flags in the code.
That, in turn, means that the work-flow is encoded in your programs rather
than outside them in the framework..

On Tue, Nov 10, 2009 at 3:17 AM, Isabel Drost <> wrote:

> On Fri Ted Dunning <> wrote:
> > The question that I don't see addressed is whether you choose to use
> > a fully streaming approach as is done in Bixo or whether you will use
> > a document repository approach as is more common in most search
> > engines.
> I guess even when using a streaming approach a repository for temporary
> results is necessary to decouple those stages that are expensive and
> hard to reproduce. E.g. crawling to HBase and reading the results from
> there for further processing should prevent failures in post processing
> resulting in having to rerun the crawl. Most likely there are more of
> these points further down the processing chain as well.
> Isabel

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message