uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche" <lists.digitalpeb...@gmail.com>
Subject Re: Donate TIkaAnnotator to Sandbox
Date Fri, 10 Oct 2008 13:10:55 GMT
Hi Thomas

I gave it a first try. I have just used it and did not seriously look at the
> code yet. Here is some initial, unsorted user feedback:
> - Having a binary TIKA jar would speed things up (needed help to get that
> built)
> - It worked fine for me once I got the jar

the Tika jar will be available in the sandbox (if the TikaAnnotator ever
gets there). Can't put binaries in a diff file

> - In my initial trial setup I added both the Tika CollectionReader and the
>  TIKA MarkupAnnotator to a CPE flow assuming that's what's needed. Only
> after
>  overcoming some confusion about the resulting CASes I realized that they
> are
>  intended to be used either/or. A word in the README may spare other people
>  the confusion.

Have added a line on this in the README

> - MarkupAnnotator.xml states <outputsNewCASes>true</outputsNewCASes>. CVD
> will
>  not show any results for annotators with that setting. And in fact the
>  annotator runs just fine with that setting changed to false. From what I
>  could see in the code it just creates a new view not a new CAS. But maybe
> I
>  am missing something here.

I changed it to false - can't remember why it was set to true in the first

> - It returned reasonable results on a few HTML, MS-Word and PPT files I
> tried.
>  I silently refused to covert one PDF file (others worked). But I guess
> this
>  are just limitations of the current PDF parser.

do you get the extracted text at least?

> - As I understand it TIKA maps all document markup to the XHTML tagset.
> Since
>  that is a closed set it should be possible to use a more explicit
> typesystem
>  modeling, where the known XHTML elements like title, body, p etc. are
>  modeled as explicit subtypes instead of having only one generic type
>  MarkupAnnotation. Is that assumption correct?

indeed but I think there is no strong constraint on the names of the
elements and most of it relies on convention (at each parser's level). This
means that the values can change. People could develop alternative parser or
parsers for new formats which would not follow the XHTML conventions. I
would rather not make too many assumptions as to what is returned by Tika
and return generic annotations.

>  Which typesystem representation to use depends on use case (and taste :-)
>  but finding and iterating over the different parts of the markup would be
>  easier with explicit types.

I suppose once could easily write a custom resource for converting the
annotations types returned by the TikaAnnotator into more explicit types if

Thank you for your feedback
DigitalPebble Ltd

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message