uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: UIMA- Support for HTML, PDF, Doc files
Date Thu, 29 Sep 2011 08:53:27 GMT

UIMA itself is just a framework to build analysis pipelines. To analyze 
HTML, PDF or Word documents
you need a component which can extract the text from these formats.

You can use Apache Tika together with our Tika integration in the addons 
to extract text from various data formats.


On 9/29/11 8:28 AM, abhishek wrote:
> Hi,
> While reading the docuemntation of UIMA, i found out that UIMA&nbsp;supports&nbsp;html
> &nbsp;
> However, when i am running the org.apache.uima.tools.docanalyzer.DocumentAnalyzer class,
it fails to understand the text.
> &nbsp;
> Kindly let me know, the correct way to read these type of files.
> &nbsp;

View raw message