uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Hampp <thomas.ha...@de.ibm.com>
Subject Re: Donate TIkaAnnotator to Sandbox
Date Fri, 10 Oct 2008 11:36:49 GMT
Julien Nioche <lists.digitalpebble@...> writes:

> Hi guys,
> Did anyone give https://issues.apache.org/jira/browse/UIMA-1095 a try? Any
> thoughts on it?
> Best,
> J.
Hi Julien,

Thanks for that contribution. I think that kind of functionality is important
for UIMA.

I gave it a first try. I have just used it and did not seriously look at the
code yet. Here is some initial, unsorted user feedback:
- Having a binary TIKA jar would speed things up (needed help to get that built)
- It worked fine for me once I got the jar
- In my initial trial setup I added both the Tika CollectionReader and the 
  TIKA MarkupAnnotator to a CPE flow assuming that's what's needed. Only after 
  overcoming some confusion about the resulting CASes I realized that they are 
  intended to be used either/or. A word in the README may spare other people 
  the confusion.
- MarkupAnnotator.xml states <outputsNewCASes>true</outputsNewCASes>. CVD will

  not show any results for annotators with that setting. And in fact the 
  annotator runs just fine with that setting changed to false. From what I 
  could see in the code it just creates a new view not a new CAS. But maybe I 
  am missing something here.
- It returned reasonable results on a few HTML, MS-Word and PPT files I tried. 
  I silently refused to covert one PDF file (others worked). But I guess this 
  are just limitations of the current PDF parser.
- The typesystem does have the necessary information needed for further 
- As I understand it TIKA maps all document markup to the XHTML tagset. Since 
  that is a closed set it should be possible to use a more explicit typesystem 
  modeling, where the known XHTML elements like title, body, p etc. are 
  modeled as explicit subtypes instead of having only one generic type 
  MarkupAnnotation. Is that assumption correct?
  Which typesystem representation to use depends on use case (and taste :-) 
  but finding and iterating over the different parts of the markup would be 
  easier with explicit types.
- I think for document level meta data attributes the situation is 
  different since it's open (but there may be core set as well).

So far for the first impressions. Good work.

- Thomas

View raw message