mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hofer <>
Subject Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing
Date Sat, 28 Nov 2009 20:30:27 GMT
Hi Drew,

currently we are using a HTML Filter module of the Univeristy 
Duisburg-Essen, that can be found here:

Another idea was to try Jericho or NekoHTML.

Thanks for your advice, we will test it and let you know, whether it 
works well.


Drew Farris schrieb:
> Hi Marc,
> How are you planning on cleaning up the HTML documents?
> Perhaps something like this would be useful: I came across an
> interesting approach a few days ago, it would be interesting to hear
> more from someone who has tried something like this:
> Described further, with java implementations here:
> Drew
> On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <> wrote:
>> Hello everybody,
>> having already presented the draft of our architecture, I would like now to
>> discuss the second layer more in detail. As mentioned before we have chosen
>> UIMA for this layer. The main aggregate currently consists of the Whitespace
>> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
>> StopwordFilter. Before processing this aggregate in a map-only job in
>> Hadoop, we want to filter all HTML tags and forward only this preprocessed
>> data to the aggregate. The reason for this is that it is difficult to change
>> the document during processing in UIMA and it is impractical to work all the
>> time on documents containing HTML tags.
>> Furthermore we are planning to add the Tagger Annotator, which implements a
>> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
>> corresponding part of speech tags to delete or not and so using them for the
>> feature extraction. One purpose could be to use at the very beginning only
>> substantives and verbs.
>> We are very interested in your comments and remarks and it would be nice to
>> hear from you.
>> Cheers,
>> Marc

View raw message