Hi Drew,
currently we are using a HTML Filter module of the Univeristy
Duisburg-Essen, that can be found here:
http://www.is.informatik.uni-duisburg.de/projects/java-unidu/filter.html
Another idea was to try Jericho or NekoHTML.
http://www.java2s.com/Product/Java/Development/HTML-Parser.htm
Thanks for your advice, we will test it and let you know, whether it
works well.
Marc
Drew Farris schrieb:
> Hi Marc,
>
> How are you planning on cleaning up the HTML documents?
>
> Perhaps something like this would be useful: I came across an
> interesting approach a few days ago, it would be interesting to hear
> more from someone who has tried something like this:
> http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
>
> Described further, with java implementations here:
> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html
>
> Drew
>
> On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <mail@marc-hofer.de> wrote:
>> Hello everybody,
>>
>> having already presented the draft of our architecture, I would like now to
>> discuss the second layer more in detail. As mentioned before we have chosen
>> UIMA for this layer. The main aggregate currently consists of the Whitespace
>> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
>> StopwordFilter. Before processing this aggregate in a map-only job in
>> Hadoop, we want to filter all HTML tags and forward only this preprocessed
>> data to the aggregate. The reason for this is that it is difficult to change
>> the document during processing in UIMA and it is impractical to work all the
>> time on documents containing HTML tags.
>>
>> Furthermore we are planning to add the Tagger Annotator, which implements a
>> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
>> corresponding part of speech tags to delete or not and so using them for the
>> feature extraction. One purpose could be to use at the very beginning only
>> substantives and verbs.
>>
>> We are very interested in your comments and remarks and it would be nice to
>> hear from you.
>>
>> Cheers,
>> Marc
>>
>
>
|