mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Heimel <>
Subject Re: TU Berlin Winter of Code Project
Date Sat, 07 Nov 2009 20:18:55 GMT
Hi Ted,

we don't plan on using a streaming approach: Each layer has to finish
its work completeley before the next layer can start processing. The
data transfer between layers happens via HDFS or - as you mentioned -
HBase. We were planning on using HBase at least for storing the
initial crawl (using the HBasewriter Plugin for Heritrix), but we have
to see whether/where HBase fits in during the later stages. I must
admit that I haven't heard of Bixo yet, so I will have to take a look
into how their architecture looks like.

As for the named entity/feature extraction: yes, this is probably
going to be one of the most challenging problems for the project. We
are currently looking into several research papers on this topic and
have just started our discussion regarding which method looks the most
promising to us. Now, we obviously aren't experts on this topic (after
all we do this project to learn about parallel machine learning), so
we will probably try to include you guys into this discussion as soon
as we have an initial proposal figured out :)


On Fri, Nov 6, 2009 at 8:57 PM, Ted Dunning <> wrote:
> The question that I don't see addressed is whether you choose to use a fully
> streaming approach as is done in Bixo or whether you will use a document
> repository approach as is more common in most search engines.
> Hbase is reputedly ready enough to serve as a document repository.  Using
> such an approach would be very helpful for the incremental nature of web
> crawls.
> What is the plan in this regard?
> On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <>wrote:
>> This is obviously only a first draft of what we think would be a suited
>> overall
>> architecture
> --
> Ted Dunning, CTO
> DeepDyve

View raw message