Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of mheimel@googlemail.com
 designates 209.85.219.226 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=mime-version:date:message-id:subject:from:to:content-type;
        b=c8CRhfdr7Q18ZtQWFuBd5khkhHXET5lWmwtm+JG1sNr3Oyohl8cw8XPBW7VaOCdxNb
         ADUCXcVlL4Yv4XpX2Im4xU1PwZzvyHBKieB8LAUf+FChbuw2eEtCygqnIPakfu4k1/DQ
         bp59s5d294icya1y0zO00yuVTXbRl27PUMpA8=
MIME-Version: 1.0
Date: Fri, 6 Nov 2009 14:06:50 +0100
Message-ID: <5eb9b7ae0911060506n3b60dbfdmd34e41fc3db95c45@mail.gmail.com>
Subject: TU Berlin Winter of Code Project
From: Max Heimel <mheimel@googlemail.com>
To: mahout-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8

Hello everybody,

we are a group of 6 master students of the Technical University of
Berlin who are currently working on a winter term project using
Mahout. Our - so called "Winter of Code" - project is mentored by
Isabel Drost and will run until February 2010. The goal of our project
is to develop a cloud-based blog search engine - think: "google news
for beginners ;)". The engine should be highly scalabe and use
Hadoop/Mahout to performing topical clustering and topic discovery for
crawled blog entries.

Based on suggestions by Isabel, we currently think of the following
layered architecture:

I. Layer: Web-crawling
A web-crawler (e.g. Herititrix) is provided with a set of known blog
URLs to perform web crawls. Heritrix is configured with a simple text
filter to only crawl urls containing the word "blog" from a
prespecified TLD (so we "know" which language the blog entries use).
We plan on outputting the crawl data directly to HDFS (e.g. via hbase-writer).

II. Layer: Preprocessing
The data is probably not structured enough to be directly processable
by a machine, so it has to be preprocessed. This
step could e.g. consist of extracting the blog fulltext from the
crawl, stemming it, finding named entitites and tagging them.
We currently think of using UIMA for this layer.

III. Layer: Feature extraction
In order to use clustering algorithms. we need to perform a feature
extraction. This could e.g. consist of generating feature vectors, a
similiarity matrix, a link graph, etc. The goal of this layer is to
have a representation of the web crawl that can be processed by
Mahout. The feature extraction will likely be implemented via a
custom-written Hadoop job.

IV. Layer: Clustering
This step consists of using a given Mahout clustering algorithm (or a
newly implemented algorithm) to cluster the blogs based on the
extraced features. For now we are probably going to use a very simple
k-means clustering of word frequency. We plan to switch to a more
sophisticated approach once the basic infrastructure is sound :)

V. Layer: Topic Discovery
Once the blog entries are clustered, each cluster needs to be assigned
a topic. This topic should be automatically determined from the blog
entries inside the cluster. Again, for now we will probably use a
very simple approach: e.g. use the most frequent words inside the cluster
(or within the center of the cluster) as topic tags.

VI. Layer: Search Engine
In order to search for blogs the tagged cluster-centers and topics provided
by Mahout need to be recombined with the information form the blog
crawl. This recombined data should then be fed into a search engine,
so users can search for a specific entry. We will probably use Solr
for this step, tagging each blog entry with it's respective cluster topic
tag(s) and creating a search index on those tags.

VII. Layer: User Front-End
This will probably be a simple web-page that sends request with the
Search Engine layer to return results to the user.

This is obviously only a first draft of what we think would be a suited overall
architecture, so there is probably lots of room for improvement. We
are for example currently looking into multiple more sophisticated
clustering approaches (e.g. spectral-clustering, graph-based
clustering), ways of representing the clustered information
(e.g. using hierarchical instead of partitional clustering, so the user can
"drill down" by topic into the results) or architectural changes (e.g. using
a "feedback loop", so search results can be used for further analysis).
So, if you have any remarks, notes or suggestions we would be happy to
hear from you :)

Cheers and looking forward to discussions with you,

Max