mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Isabel Drost <isa...@apache.org>
Subject Re: TU Berlin Winter of Code Project
Date Wed, 11 Nov 2009 00:52:30 GMT
On Friday 06 November 2009 20:47:00 Grant Ingersoll wrote:
> On Nov 6, 2009, at 5:06 AM, Max Heimel wrote:
> > II. Layer: Preprocessing
> > The data is probably not structured enough to be directly processable
> > by a machine, so it has to be preprocessed. This
> > step could e.g. consist of extracting the blog fulltext from the
> > crawl, stemming it, finding named entitites and tagging them.
> > We currently think of using UIMA for this layer.
>
> This could likely be done as M/R jobs too and contributed to Mahout
> utils module if so desired.

+1 

Though I know of code* at TU for retrieving blog urls via Yahoo! Boss 
and "guessing" the rss feed url. In a first iteration this might be a nice 
way of getting around the problem of having to parse the html code and 
separating blog posting from comments from navigational code.

Isabel


* That is fine to publish under Apache Software License according to the guys 
at the research group.

-- 
QOTD: If you lose a son you can always get another, but there's only one 
Maltese Falcon.   -- Sidney Greenstreet, "The Maltese Falcon" 
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  <xmpp://MaineC.@spaceboyz.net>


Mime
View raw message