incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Holsman <li...@holsman.net>
Subject Re: Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA
Date Fri, 25 Aug 2006 22:08:42 GMT
Hi Thilo
your explanation attracted me ;-)

is UIMA just the interface specification only ? (ie to produce a  
standard in the unstructured text-processing world so that other  
people can plug and play)
or does UIMA also provide tools for each component?

I'm interested, and time permitting, could help as a mentor .. I'm  
not a java expert (compared to others on this list), or a text  
processing expert, but I know
a bit about the processes around the incubator.

regards
Ian


On 26/08/2006, at 2:04 AM, Thilo Goetz wrote:

> Leo Simons wrote:
>
> <snip/>
>
>> What does it *do*? How does it *work*? I understand there's a  
>> runtime and
>> a framework and a standardization process and a component-based
>> interoperability goal, but what I don't understand is what they  
>> are *for*.
>
> The unstructured content we're talking about is mainly plain text  
> today.  There is also some work going on analyzing video streams,  
> as well as multi-modal streams (e.g., video + closed captioning).   
> I'm not really competent to talk about those, so I'll stick to  
> text.  A typical processing chain for text analysis starts out  
> something like this:
>
> "language identification" -> "language specific segmentation" ->  
> "sentence boundary detection" -> "entity detection (person/place  
> names etc.)" -> ...
>
> So you start by identifying the language the text is in (Chinese,  
> English etc.).  Then you do token segmentation based on that  
> information (it's completely different for Chinese than for  
> English).  Based on the tokens you discovered, you may want to do  
> sentence boundary detection, so you know what entities occur in the  
> same sentence.  Then, again based on the tokens you've found, you  
> can do so-called named entity detection, such as place names,  
> person names etc.  After that, you may have another module that can  
> discover relations between the entities that you have found.  And  
> so on.
>
> UIMA in its core is a component architecture that allows you to  
> create analysis applications like the one described above.  It  
> provides facilities for creating meta-information on documents like  
> in the example above.  That is, the original artifact (i.e., the  
> text) is not modified and the derived information is kept separately.
>
> UIMA is mostly a framework, not an application.  So it is not  
> concerned with fetching documents, like the crawler of a search  
> engine.  Nor does UIMA provide facilities to do very much with the  
> information you have extracted from the text (or other artifact).   
> Rather, the use case is that you have an application that has a  
> need for the processing of unstructured information.  This  
> application will provide the input data, and it will know what to  
> do with the results.  The value of UIMA derives from the component  
> model: it is easy to reuse existing analysis components that other  
> people have written, and it's easy to exchange, say, one language  
> identifier for another.
>
> One standard application scenario is to use UIMA to extract some  
> named entities from text, feed the results into a relational  
> database, and use the database's mining capabilities to do, e.g.,  
> association analysis. Another area of application is enhanced text  
> search, where in addition to regular free-form text search, you can  
> search for documents containing certain entities.  Trivial standard  
> example: you're looking for John's phone number in your email, so  
> you use semantic search to look for documents that contain John's  
> name and a phone number.  You'll use a UIMA component that knows  
> that a pattern 123-456-7890 is a phone number and will create a  
> phone number entity.
>
> I hope this gives you a better idea what UIMA is about.
>
> --Thilo
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

--
Ian Holsman
Ian@Zilbo.com
http://personalinjuryfocus.com/




---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message