mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <drew.far...@gmail.com>
Subject Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing
Date Sat, 28 Nov 2009 20:08:41 GMT
Hi Marc,

How are you planning on cleaning up the HTML documents?

Perhaps something like this would be useful: I came across an
interesting approach a few days ago, it would be interesting to hear
more from someone who has tried something like this:
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

Described further, with java implementations here:
http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html

Drew

On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <mail@marc-hofer.de> wrote:
> Hello everybody,
>
> having already presented the draft of our architecture, I would like now to
> discuss the second layer more in detail. As mentioned before we have chosen
> UIMA for this layer. The main aggregate currently consists of the Whitespace
> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
> StopwordFilter. Before processing this aggregate in a map-only job in
> Hadoop, we want to filter all HTML tags and forward only this preprocessed
> data to the aggregate. The reason for this is that it is difficult to change
> the document during processing in UIMA and it is impractical to work all the
> time on documents containing HTML tags.
>
> Furthermore we are planning to add the Tagger Annotator, which implements a
> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
> corresponding part of speech tags to delete or not and so using them for the
> feature extraction. One purpose could be to use at the very beginning only
> substantives and verbs.
>
> We are very interested in your comments and remarks and it would be nice to
> hear from you.
>
> Cheers,
> Marc
>

Mime
View raw message