uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Question regarding the encoding of footnotes, marginal notes and images
Date Wed, 22 Mar 2017 20:45:06 GMT

Here are some thoughts.

* You have main-text, images, margin notes, and for the latter two, "position on
the page" information.

You should put the main-text into a sofa, like you say.

You may put the images and margin notes into either additional sofas or feature
structures in the main sofa.

The decision for where to put these depends on what kind of analysis you plan to
do with the images and margin notes.  They should be in sofas if you plan to run
some unstructured analytics annotators over them, for example some image
recognition or classification analytics.  But if you just need to keep these as
artifacts, with no particular kind of analytics for these parts, just put them
in additional feature structures in the main sofa.

Re:  can UIMA handle sofas with different kinds of data:  yes it can.  Each sofa
can be a text string or a byte array (local or remote); see:

Re: can annotations refer to feature structures in other sofas: yes they can.



On 3/22/2017 10:32 AM, Markus Krug wrote:
> Dear UIMA-users,
> we are currently facing the issue, that the documents we are processing
> using UIMA have more than just "linear text".
> On top of text we got images and marginal notes that should be encoded
> at the correct positions. (Output of OCR and image segmentation)
> So far i do not know if UIMA is capable of handling sofas with different
> types of material (e.g. text and images)
> We came up with a concept like this (please comment if this is stupid or
> if better ways to handle this have been found already)
> 1. Store the main text in the primary sofa
> 2. For each image/marginal note, use a different sofa and store the
> content in there
> 3. In the main text, refer to annotations in different sofas (is this
> possible? - i never needed this before) at the according position
> If there are any best praqctices for those kind of problems i would be
> glad if you would let me know
> Thanks in advance
> Markus Krug

View raw message