Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 1636 invoked from network); 6 Nov 2009 13:07:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Nov 2009 13:07:15 -0000 Received: (qmail 82728 invoked by uid 500); 6 Nov 2009 13:07:15 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 82672 invoked by uid 500); 6 Nov 2009 13:07:14 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 82662 invoked by uid 99); 6 Nov 2009 13:07:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Nov 2009 13:07:14 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mheimel@googlemail.com designates 209.85.219.226 as permitted sender) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Nov 2009 13:07:12 +0000 Received: by ewy26 with SMTP id 26so1208538ewy.5 for ; Fri, 06 Nov 2009 05:06:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=ijVaV0s2z66UzW6KLUpRMR577BcwJokQq25+gyL4tLw=; b=d4NOtmh2bhSTPZ1O4v7a+RX0LA0GuFKkh2gHr5MKsublCuG4sapWhHbeOKYSD0VoDi s5w8teXJ9HDLZ0fZqeY5p2cG1B6++0DGMiypLJletQ4l63EybYYbHUurr3ynK0WhUcSf epX+QNbwhd6BJE0zrnAnB/B3d5QQSDIHag+qc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=c8CRhfdr7Q18ZtQWFuBd5khkhHXET5lWmwtm+JG1sNr3Oyohl8cw8XPBW7VaOCdxNb ADUCXcVlL4Yv4XpX2Im4xU1PwZzvyHBKieB8LAUf+FChbuw2eEtCygqnIPakfu4k1/DQ bp59s5d294icya1y0zO00yuVTXbRl27PUMpA8= MIME-Version: 1.0 Received: by 10.216.90.139 with SMTP id e11mr1494446wef.111.1257512810569; Fri, 06 Nov 2009 05:06:50 -0800 (PST) Date: Fri, 6 Nov 2009 14:06:50 +0100 Message-ID: <5eb9b7ae0911060506n3b60dbfdmd34e41fc3db95c45@mail.gmail.com> Subject: TU Berlin Winter of Code Project From: Max Heimel To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Hello everybody, we are a group of 6 master students of the Technical University of Berlin who are currently working on a winter term project using Mahout. Our - so called "Winter of Code" - project is mentored by Isabel Drost and will run until February 2010. The goal of our project is to develop a cloud-based blog search engine - think: "google news for beginners ;)". The engine should be highly scalabe and use Hadoop/Mahout to performing topical clustering and topic discovery for crawled blog entries. Based on suggestions by Isabel, we currently think of the following layered architecture: I. Layer: Web-crawling A web-crawler (e.g. Herititrix) is provided with a set of known blog URLs to perform web crawls. Heritrix is configured with a simple text filter to only crawl urls containing the word "blog" from a prespecified TLD (so we "know" which language the blog entries use). We plan on outputting the crawl data directly to HDFS (e.g. via hbase-writer). II. Layer: Preprocessing The data is probably not structured enough to be directly processable by a machine, so it has to be preprocessed. This step could e.g. consist of extracting the blog fulltext from the crawl, stemming it, finding named entitites and tagging them. We currently think of using UIMA for this layer. III. Layer: Feature extraction In order to use clustering algorithms. we need to perform a feature extraction. This could e.g. consist of generating feature vectors, a similiarity matrix, a link graph, etc. The goal of this layer is to have a representation of the web crawl that can be processed by Mahout. The feature extraction will likely be implemented via a custom-written Hadoop job. IV. Layer: Clustering This step consists of using a given Mahout clustering algorithm (or a newly implemented algorithm) to cluster the blogs based on the extraced features. For now we are probably going to use a very simple k-means clustering of word frequency. We plan to switch to a more sophisticated approach once the basic infrastructure is sound :) V. Layer: Topic Discovery Once the blog entries are clustered, each cluster needs to be assigned a topic. This topic should be automatically determined from the blog entries inside the cluster. Again, for now we will probably use a very simple approach: e.g. use the most frequent words inside the cluster (or within the center of the cluster) as topic tags. VI. Layer: Search Engine In order to search for blogs the tagged cluster-centers and topics provided by Mahout need to be recombined with the information form the blog crawl. This recombined data should then be fed into a search engine, so users can search for a specific entry. We will probably use Solr for this step, tagging each blog entry with it's respective cluster topic tag(s) and creating a search index on those tags. VII. Layer: User Front-End This will probably be a simple web-page that sends request with the Search Engine layer to return results to the user. This is obviously only a first draft of what we think would be a suited overall architecture, so there is probably lots of room for improvement. We are for example currently looking into multiple more sophisticated clustering approaches (e.g. spectral-clustering, graph-based clustering), ways of representing the clustered information (e.g. using hierarchical instead of partitional clustering, so the user can "drill down" by topic into the results) or architectural changes (e.g. using a "feedback loop", so search results can be used for further analysis). So, if you have any remarks, notes or suggestions we would be happy to hear from you :) Cheers and looking forward to discussions with you, Max