uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eddie Epstein" <eaepst...@gmail.com>
Subject Re: Content segmentation
Date Mon, 16 Jun 2008 21:33:49 GMT
Hi Yaakov,

I wanted to find out if UIMA has any concept of content segmentation.
> Some of the analysis processing is very memory and CPU intensive and
> if the content happens to be huge (like a book), it will bring the
> server to a crawl.
> So, I was wondering if the UIMA framework has any notion of breaking
> up the content into smaller segments.

Content segmentation is a core concept in UIMA, with each CAS typically
considered to contain an "artifact" to be analyzed. Something has to segment
the input corpus into discrete artifacts.

In the most common scenario, a "collection reader" at the front of the UIMA
pipeline segments the input and initializes each CAS. For other scenarios
the "CAS Multiplier", a more general segmentation component, is used to
initialize CASes. A CAS Multiplier (CM) can be called at any point in a UIMA
pipeline; indeed multiple CM components can be used in the same pipeline.

Consider a scenario where a CM is given an input CAS with a pointer to a
large audio file. The CM could read the audio file, segment at boundaries
appropriate for subsequent analysis, and create new CASes with just the
audio content for each segment.

Note that the artifact to be analyzed, called the Subject of analysis
(Sofa), does not have to reside in the CAS itself. UIMA supports the notion
of "remote Sofas" represented in the CAS by a URI. UIMA also provides stream
access methods for remote Sofa content which in Java simply map to URI
stream reading.

Hoping this actually addresses your question,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message