incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA
Date Fri, 25 Aug 2006 16:04:04 GMT
Leo Simons wrote:

<snip/>

> What does it *do*? How does it *work*? I understand there's a runtime and
> a framework and a standardization process and a component-based
> interoperability goal, but what I don't understand is what they are *for*.

The unstructured content we're talking about is mainly plain text today. 
  There is also some work going on analyzing video streams, as well as 
multi-modal streams (e.g., video + closed captioning).  I'm not really 
competent to talk about those, so I'll stick to text.  A typical 
processing chain for text analysis starts out something like this:

"language identification" -> "language specific segmentation" -> 
"sentence boundary detection" -> "entity detection (person/place names 
etc.)" -> ...

So you start by identifying the language the text is in (Chinese, 
English etc.).  Then you do token segmentation based on that information 
(it's completely different for Chinese than for English).  Based on the 
tokens you discovered, you may want to do sentence boundary detection, 
so you know what entities occur in the same sentence.  Then, again based 
on the tokens you've found, you can do so-called named entity detection, 
such as place names, person names etc.  After that, you may have another 
module that can discover relations between the entities that you have 
found.  And so on.

UIMA in its core is a component architecture that allows you to create 
analysis applications like the one described above.  It provides 
facilities for creating meta-information on documents like in the 
example above.  That is, the original artifact (i.e., the text) is not 
modified and the derived information is kept separately.

UIMA is mostly a framework, not an application.  So it is not concerned 
with fetching documents, like the crawler of a search engine.  Nor does 
UIMA provide facilities to do very much with the information you have 
extracted from the text (or other artifact).  Rather, the use case is 
that you have an application that has a need for the processing of 
unstructured information.  This application will provide the input data, 
and it will know what to do with the results.  The value of UIMA derives 
from the component model: it is easy to reuse existing analysis 
components that other people have written, and it's easy to exchange, 
say, one language identifier for another.

One standard application scenario is to use UIMA to extract some named 
entities from text, feed the results into a relational database, and use 
the database's mining capabilities to do, e.g., association analysis. 
Another area of application is enhanced text search, where in addition 
to regular free-form text search, you can search for documents 
containing certain entities.  Trivial standard example: you're looking 
for John's phone number in your email, so you use semantic search to 
look for documents that contain John's name and a phone number.  You'll 
use a UIMA component that knows that a pattern 123-456-7890 is a phone 
number and will create a phone number entity.

I hope this gives you a better idea what UIMA is about.

--Thilo


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message