uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Lally" <ala...@alum.rpi.edu>
Subject Re: UIMA document loading strategy
Date Wed, 27 Jun 2007 13:43:11 GMT
On 6/25/07, Arthit Suriyawongkul <arthit@gmail.com> wrote:
> Hi,
>
> How UIMA load document to memory ?
> Does it load the whole document at once, or it only read document
> partially (sometime stream-like).
>
> Now I'm using GATE and sometimes got a problem if my document is very large,
> as GATE trying to load the whole document into the memory first and
> convert it to
> its own representation.
> My application doesn't need a knowledge of the whole document (like DOM),
> but only takes data from a small-size window (e.g. less than 100
> characters) at a time.
>
> cheers,
> Art
>

Hi Art,

UIMA is flexible with respect to this.  You can provide a
CollectionReader that populates a CAS with however much text is
appropriate for your application.  So a single document could be split
across many CASes in order to decrease the overall memory
requirements.

It's also possible to split a CAS into smaller CASes, do annotation on
each, and then merge the results.  The kind of component that does the
split and merge is called a "CAS Multiplier".  There's an example of
this in the uimaj-examples project that comes with the download - see
descriptors/cas_multiplier/Segment_Annotate_Merge_AE.  This is
described in the "CAS Multiplier Developer's Guide" section of the
documentation.

Another option is to consider using a "remote Sofa" (Sofa = subject of
analysis).  In this case the CAS just contains a URL to where the
actual document lives, not the document text itself.  See the
"Annotations, Artifacts, and Sofas" section of the documentaiton.

-Adam

Mime
View raw message