uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Fisher <safi...@gmail.com>
Subject parsing html as a string from getDocumentText
Date Wed, 12 Mar 2008 17:29:32 GMT
Hi All,

Having played around with plain text files in UIMA, I'm now inputting an 
html file to the Document Analyzer.  The jcas holds the contents of this 
file, both mark up and text, as a text string.  After reading through 
the markmail archives, I decide to try using the jericho html parser for 
extracting the plain text content from the html string (e.g. String 
theHtml = jcas.getDocumentText()). I'm probably not using Jericho 
correctly, because the output of the parser is the same as what went in 
(not stripped down to only the text content).

So that I bark up the right tree, I wonder if the CAS forces some kind 
of encoding, like UTF-8, that might cause the parser to be blind to the 
mark up tags in the html string?  This seems ridiculous, but I thought 
I'd ask.

Has anyone had success using jericho with uima?

Many thanks,


View raw message