mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: TU Berlin Winter of Code Project
Date Fri, 06 Nov 2009 19:47:00 GMT

On Nov 6, 2009, at 5:06 AM, Max Heimel wrote:

> Hello everybody,
> we are a group of 6 master students of the Technical University of
> Berlin who are currently working on a winter term project using
> Mahout. Our - so called "Winter of Code" - project is mentored by
> Isabel Drost and will run until February 2010. The goal of our project
> is to develop a cloud-based blog search engine - think: "google news
> for beginners ;)". The engine should be highly scalabe and use
> Hadoop/Mahout to performing topical clustering and topic discovery for
> crawled blog entries.
> Based on suggestions by Isabel, we currently think of the following
> layered architecture:

I like the layered approach, this should make it easier for others to  
adapt and use.

> I. Layer: Web-crawling
> A web-crawler (e.g. Herititrix) is provided with a set of known blog
> URLs to perform web crawls. Heritrix is configured with a simple text
> filter to only crawl urls containing the word "blog" from a
> prespecified TLD (so we "know" which language the blog entries use).
> We plan on outputting the crawl data directly to HDFS (e.g. via  
> hbase-writer).
> II. Layer: Preprocessing
> The data is probably not structured enough to be directly processable
> by a machine, so it has to be preprocessed. This
> step could e.g. consist of extracting the blog fulltext from the
> crawl, stemming it, finding named entitites and tagging them.
> We currently think of using UIMA for this layer.

This could likely be done as M/R jobs too and contributed to Mahout  
utils module if so desired.

> III. Layer: Feature extraction
> In order to use clustering algorithms. we need to perform a feature
> extraction. This could e.g. consist of generating feature vectors, a
> similiarity matrix, a link graph, etc. The goal of this layer is to
> have a representation of the web crawl that can be processed by
> Mahout. The feature extraction will likely be implemented via a
> custom-written Hadoop job.

It will be really useful to hear your feedback on what works and  
doesn't here, especially on noisy web data.

> IV. Layer: Clustering
> This step consists of using a given Mahout clustering algorithm (or a
> newly implemented algorithm) to cluster the blogs based on the
> extraced features. For now we are probably going to use a very simple
> k-means clustering of word frequency. We plan to switch to a more
> sophisticated approach once the basic infrastructure is sound :)

It should be pretty straightforward to use the different  
implementations in Mahout here.  I'd really love to hear benchmarks,  

> V. Layer: Topic Discovery
> Once the blog entries are clustered, each cluster needs to be assigned
> a topic. This topic should be automatically determined from the blog
> entries inside the cluster. Again, for now we will probably use a
> very simple approach: e.g. use the most frequent words inside the  
> cluster
> (or within the center of the cluster) as topic tags.

See the patch on log-likelihood up in Mahout's JIRA.  Feedback on this  
would be great.

> VI. Layer: Search Engine
> In order to search for blogs the tagged cluster-centers and topics  
> provided
> by Mahout need to be recombined with the information form the blog
> crawl. This recombined data should then be fed into a search engine,
> so users can search for a specific entry. We will probably use Solr
> for this step, tagging each blog entry with it's respective cluster  
> topic
> tag(s) and creating a search index on those tags.

I'd love to see the clustering stuff worked into Solr in the  
ClusteringComponent.  See contrib/clustering within Solr.  You might  
find moving this layer up closer to the crawl actually makes  
preprocessing and feature extraction a whole lot easier.

> VII. Layer: User Front-End
> This will probably be a simple web-page that sends request with the
> Search Engine layer to return results to the user.
> This is obviously only a first draft of what we think would be a  
> suited overall
> architecture, so there is probably lots of room for improvement. We
> are for example currently looking into multiple more sophisticated
> clustering approaches (e.g. spectral-clustering, graph-based
> clustering), ways of representing the clustered information
> (e.g. using hierarchical instead of partitional clustering, so the  
> user can
> "drill down" by topic into the results) or architectural changes  
> (e.g. using
> a "feedback loop", so search results can be used for further  
> analysis).
> So, if you have any remarks, notes or suggestions we would be happy to
> hear from you :)

Those all sound really good.  Looking forward to hearing more.

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message