incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Holsman <>
Subject Re: Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA
Date Fri, 25 Aug 2006 22:08:42 GMT
Hi Thilo
your explanation attracted me ;-)

is UIMA just the interface specification only ? (ie to produce a  
standard in the unstructured text-processing world so that other  
people can plug and play)
or does UIMA also provide tools for each component?

I'm interested, and time permitting, could help as a mentor .. I'm  
not a java expert (compared to others on this list), or a text  
processing expert, but I know
a bit about the processes around the incubator.


On 26/08/2006, at 2:04 AM, Thilo Goetz wrote:

> Leo Simons wrote:
> <snip/>
>> What does it *do*? How does it *work*? I understand there's a  
>> runtime and
>> a framework and a standardization process and a component-based
>> interoperability goal, but what I don't understand is what they  
>> are *for*.
> The unstructured content we're talking about is mainly plain text  
> today.  There is also some work going on analyzing video streams,  
> as well as multi-modal streams (e.g., video + closed captioning).   
> I'm not really competent to talk about those, so I'll stick to  
> text.  A typical processing chain for text analysis starts out  
> something like this:
> "language identification" -> "language specific segmentation" ->  
> "sentence boundary detection" -> "entity detection (person/place  
> names etc.)" -> ...
> So you start by identifying the language the text is in (Chinese,  
> English etc.).  Then you do token segmentation based on that  
> information (it's completely different for Chinese than for  
> English).  Based on the tokens you discovered, you may want to do  
> sentence boundary detection, so you know what entities occur in the  
> same sentence.  Then, again based on the tokens you've found, you  
> can do so-called named entity detection, such as place names,  
> person names etc.  After that, you may have another module that can  
> discover relations between the entities that you have found.  And  
> so on.
> UIMA in its core is a component architecture that allows you to  
> create analysis applications like the one described above.  It  
> provides facilities for creating meta-information on documents like  
> in the example above.  That is, the original artifact (i.e., the  
> text) is not modified and the derived information is kept separately.
> UIMA is mostly a framework, not an application.  So it is not  
> concerned with fetching documents, like the crawler of a search  
> engine.  Nor does UIMA provide facilities to do very much with the  
> information you have extracted from the text (or other artifact).   
> Rather, the use case is that you have an application that has a  
> need for the processing of unstructured information.  This  
> application will provide the input data, and it will know what to  
> do with the results.  The value of UIMA derives from the component  
> model: it is easy to reuse existing analysis components that other  
> people have written, and it's easy to exchange, say, one language  
> identifier for another.
> One standard application scenario is to use UIMA to extract some  
> named entities from text, feed the results into a relational  
> database, and use the database's mining capabilities to do, e.g.,  
> association analysis. Another area of application is enhanced text  
> search, where in addition to regular free-form text search, you can  
> search for documents containing certain entities.  Trivial standard  
> example: you're looking for John's phone number in your email, so  
> you use semantic search to look for documents that contain John's  
> name and a phone number.  You'll use a UIMA component that knows  
> that a pattern 123-456-7890 is a phone number and will create a  
> phone number entity.
> I hope this gives you a better idea what UIMA is about.
> --Thilo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Ian Holsman

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message