mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hofer <m...@marc-hofer.de>
Subject Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing
Date Sat, 28 Nov 2009 20:30:27 GMT
Hi Drew,

currently we are using a HTML Filter module of the Univeristy 
Duisburg-Essen, that can be found here: 
http://www.is.informatik.uni-duisburg.de/projects/java-unidu/filter.html

Another idea was to try Jericho or NekoHTML.
http://www.java2s.com/Product/Java/Development/HTML-Parser.htm

Thanks for your advice, we will test it and let you know, whether it 
works well.

Marc

Drew Farris schrieb:
> Hi Marc,
> 
> How are you planning on cleaning up the HTML documents?
> 
> Perhaps something like this would be useful: I came across an
> interesting approach a few days ago, it would be interesting to hear
> more from someone who has tried something like this:
> http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
> 
> Described further, with java implementations here:
> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html
> 
> Drew
> 
> On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <mail@marc-hofer.de> wrote:
>> Hello everybody,
>>
>> having already presented the draft of our architecture, I would like now to
>> discuss the second layer more in detail. As mentioned before we have chosen
>> UIMA for this layer. The main aggregate currently consists of the Whitespace
>> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
>> StopwordFilter. Before processing this aggregate in a map-only job in
>> Hadoop, we want to filter all HTML tags and forward only this preprocessed
>> data to the aggregate. The reason for this is that it is difficult to change
>> the document during processing in UIMA and it is impractical to work all the
>> time on documents containing HTML tags.
>>
>> Furthermore we are planning to add the Tagger Annotator, which implements a
>> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
>> corresponding part of speech tags to delete or not and so using them for the
>> feature extraction. One purpose could be to use at the very beginning only
>> substantives and verbs.
>>
>> We are very interested in your comments and remarks and it would be nice to
>> hear from you.
>>
>> Cheers,
>> Marc
>>
> 
> 


Mime
View raw message