Hi,
what we are using is something like JODConverter or a simple bridge to
microsoft word or open office in order to convert the document (rtf or
doc/docx) to html. Then, we apply the HTMLAnnotator and HTMLConverter of
UIMA Ruta in order to get plain text with annotations for the html tags.
However, we do not have an (available) analysis engine for this complete
process.
Best,
Peter
Am 01.09.2013 23:42, schrieb Dave Kincaid:
> Before I embark on building an RTF annotator I thought I'd ask around a bit to
> see if anyone had built such a thing. Most of the documents I have to handle
> are in RTF format. I can pretty easily extract the text only using something
> like Apache TIka, but there is important information in the formatting as well
> (bold, italic, font sizes, centering, tables, etc) that I'd like to use. Is
> anyone aware of a UIMA annotator that does this already?
>
> Thanks,
>
> Dave Kincaid
>
|