mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: TU Berlin Winter of Code Project
Date Fri, 06 Nov 2009 19:54:43 GMT
Named entity extraction and feature extraction are both going to be very
challenging in the web environment.

On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <gsingers@apache.org>wrote:

>
>> II. Layer: Preprocessing
>> The data is probably not structured enough to be directly processable
>> by a machine, so it has to be preprocessed. This
>> step could e.g. consist of extracting the blog fulltext from the
>> crawl, stemming it, finding named entitites and tagging them.
>> We currently think of using UIMA for this layer.
>>
>
> This could likely be done as M/R jobs too and contributed to Mahout utils
> module if so desired.
>
>
>
>> III. Layer: Feature extraction
>> In order to use clustering algorithms. we need to perform a feature
>> extraction. This could e.g. consist of generating feature vectors, a
>> similiarity matrix, a link graph, etc. The goal of this layer is to
>> have a representation of the web crawl that can be processed by
>> Mahout. The feature extraction will likely be implemented via a
>> custom-written Hadoop job.
>>
>
> It will be really useful to hear your feedback on what works and doesn't
> here, especially on noisy web data.




-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message