mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <>
Subject Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing
Date Sat, 28 Nov 2009 20:08:41 GMT
Hi Marc,

How are you planning on cleaning up the HTML documents?

Perhaps something like this would be useful: I came across an
interesting approach a few days ago, it would be interesting to hear
more from someone who has tried something like this:

Described further, with java implementations here:


On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <> wrote:
> Hello everybody,
> having already presented the draft of our architecture, I would like now to
> discuss the second layer more in detail. As mentioned before we have chosen
> UIMA for this layer. The main aggregate currently consists of the Whitespace
> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
> StopwordFilter. Before processing this aggregate in a map-only job in
> Hadoop, we want to filter all HTML tags and forward only this preprocessed
> data to the aggregate. The reason for this is that it is difficult to change
> the document during processing in UIMA and it is impractical to work all the
> time on documents containing HTML tags.
> Furthermore we are planning to add the Tagger Annotator, which implements a
> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
> corresponding part of speech tags to delete or not and so using them for the
> feature extraction. One purpose could be to use at the very beginning only
> substantives and verbs.
> We are very interested in your comments and remarks and it would be nice to
> hear from you.
> Cheers,
> Marc

View raw message