uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: UIMA- Support for HTML, PDF, Doc files
Date Thu, 29 Sep 2011 08:53:27 GMT
Hello,

UIMA itself is just a framework to build analysis pipelines. To analyze 
HTML, PDF or Word documents
you need a component which can extract the text from these formats.

You can use Apache Tika together with our Tika integration in the addons 
project
to extract text from various data formats.

Jörn

On 9/29/11 8:28 AM, abhishek wrote:
> Hi,
> While reading the docuemntation of UIMA, i found out that UIMA&nbsp;supports&nbsp;html
files.
> &nbsp;
> However, when i am running the org.apache.uima.tools.docanalyzer.DocumentAnalyzer class,
it fails to understand the text.
> &nbsp;
> Kindly let me know, the correct way to read these type of files.
> &nbsp;


Mime
View raw message