incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Proposal for a new incubation project: Unstructured Information Management Architecture - UIMA
Date Fri, 25 Aug 2006 17:25:11 GMT
Hi Leo,

Here's a response to your good questions; apologies to you and others 
(David Welton also commented) that we were not clearer, initially.

 >
 > <snip> I understand there's a runtime and
 > a framework and a standardization process and a component-based
 > interoperability goal, but what I don't understand is what they are 
*for*.
 >
 > <snip> outline what problem this UIMA thing is meant to solve

We are working to move the community of folks who write unstructured 
analysis software to a place where these things can be easily put 
together.  Here's some examples of the kind of software we mean: 

* code that works with text and identifies words or phrases as 
particular kinds of entities, such as persons, places, organizations, 
chemical names, times, dates, telephone numbers, or sentiment analysis 
(e.g likes, hates), etc.

* code that works with audio samples and extract text (think of speech 
recognition)

* code that translates text in one language to another

* code that finds similarities among images, or computes a similarity 
score for images

Today, you can find these components bundled up inside various 
solutions, etc.  Our goal was to enable putting together these parts 
into interesting new kinds of applications.  For instance:

* an application that takes several approaches to speech recognition, or 
machine translation, and combines them (in the hope of getting a better 
quality result)

* a search application that wants to search for "concepts" as well as 
key-words, where these concepts have been added by pre-processing the 
data being searched with a set of these components.  You might imagine 
the set of components used would depend on the kind of searching - if 
you were working with genetics, you might have textual information about 
"genes" identified, or you might be searching for a kind of image, along 
with some text.

* A "business intelligence" (not an oxymoron :-) (sorry for the 
buzzword) application that looks for trends in various kinds of messages 
- using components that might tag texts with concepts like "positive 
remark" or "negative remark".

---------------

 > What does it *do*?

It runs components that are written to conform to its architecture 
(remote or local, in several computer languages) in a flow, which is 
configured with external-to-the-component-part configuration files. 

 > How does it *work*?
 > <snip> outline what the approach is to solving that problem

Components work with one another by adopting a model where data is 
passed from component to component.  Each component examines the data, 
runs its particular unstructured analysis special capability, and adds 
to the data.  Components are required to specify descriptive information 
which UIMA uses in common tooling and running.
Common services needed by components and solutions built with these 
components are provided.

---------------

 > <snip> outline how this turns into software

The data passed from part to part is described with a single-inheritence 
type system.  Support is provided for parts written in Java, C++, and 
some scripting languages (Perl, Python, Tcl).  A variety of tradeoffs in 
programming styles are supported, from Java centric object-oriented 
styles, to styles that go for very high performance and avoid creating 
Java "objects".  Common services to process very large collections of 
things through the flow of components, with pragmatic error handling and 
recovery, are provided.

---------------

 > <snip> give an example or even two of such software in use in the 
real world to
 >    solve some kind of tangible problem

As part of a company's monitoring of it's outgoing Email (say, as part 
of efforts to comply with Sarbanes-Oxley Act), it could deploy 
components that detect a variety of named entities (persons, places, 
organizations, etc.) and relationships among these.  The company could 
deploy commercially available components, plus some customized for their 
particular domain. (As an example, a recent announcement of some 
commercially available components can be seen at 
http://be.sys-con.com/read/262873.htm )

Another example might involve an insurance brokerage whose employees 
need access to information from insurance policies, notes from field 
adjusters, emails from customers, etc. To enable more focussed 
searching, these data could be augmented with the results of 
unstructured information analysis, and the resulting "structured" 
information could be used in searching.  The company might want to 
integrate commercial, more generic named entity detectors, with specific 
recognizers for their particular needs.  Search could be performed with 
engines capable of searching both using traditional key-words, as well 
as looking for concepts (added by the various UIMA parts), and key-words 
contained within the span of particular concepts.  [Note: these search 
engines already exist].  In addition to search, the broker might process 
this information, looking for early indication of potential issues by 
noticing trends in various kinds of things being reported. 

---------------------

We're not trying to re-invent the semantic web movement.  However, we 
think UIMA might enable some aspects of it, by allowing the flourishing 
of a rich set of unstructured information analysis components, within 
the community.

-Marshall

Leo Simons wrote:
> Hi Marshall!
>
> I'm sure all this is potentially interesting, but you're going to have
> to help us understand why.
>
> On Wed, Aug 23, 2006 at 03:21:55PM -0400, Marshall Schor wrote:
>   
>> Proposal for Incubation Project: Unstructured Information Management 
>> Architecture - UIMA
>>
>> The Unstructured Information Management Architecture (UIMA) is an 
>> architecture and software framework for creating, discovering, composing 
>> and deploying a broad range of multi-modal analysis capabilities.  We 
>> propose a project to develop, implement, support and enhance UIMA 
>> framework implementations that comply with the UIMA standard (being put 
>> forward concurrently for standardization within OASIS 
>> http://www.oasis-open.org - not yet submitted, but we plan to do this 
>> early in September.). 
>>     
> <snip/>
>   
>> Motivation for UIMA: Databases are core components of nearly all 
>> applications; they store information in structured tables.  But more and 
>> more of the available digital data is unstructured (e.g. email, web 
>> documents, images, audio clips, video streams) with little information 
>> (metadata) attached to explain its content or context.  Although many 
>> applications have been built to process unstructured data, they have 
>> either managed it as a BLOB or they have developed isolated applications 
>> for analyzing the content.  In the absence of a standardized means for 
>> analytical applications to share insights extracted from the content, 
>> analytical applications cannot build upon one another. As a result, the 
>> industry has barely begun to tap the value locked in unstructured 
>> information.
>>     
> <snip/>
>
> What does it *do*? How does it *work*? I understand there's a runtime and
> a framework and a standardization process and a component-based
> interoperability goal, but what I don't understand is what they are *for*.
>
> Can you please write a paragraph or two, that
>
> 1) doesn't mention "what the industry is doing" or needs to do
> 2) doesn't mention frameworks, standards, or current problematic
>    industry practices, SOA, SOAP, DARPA, OASIS, or other acronyms
> 3) outlines what problem this UIMA thing is meant to solve
> 4) outlines what the approach is to solving that problem
> 5) outlines how this turns into software
> 6) gives an example or even two of such software in use in the real world to
>    solve some kind of tangible problem
>
> For example, one kind of "unstructured information" is "the web", and one
> way to process that is "as plain text, indexing it, create a keyword-based search
> engine", and then there's also fancier ways such as all the things that google
> does. And then there's also various ways to make the unstructured mess that is the
> web more structured by attaching metadata, eg dublin core metadata or the whwole
> semantic web thing, so right now I might walk away with the understanding that
> you're devising a way for google and yahoo to interop (which I doubt they really
> want) by re-inventing the semantic web movement (which I doubt is really
> productive). Enlighten me, please. If it helps, imagine I'm 12 and write PHP and 
> have difficulty with words such as interoperability since English is not my first
> language.
>
> cheers,
>
> LSD
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message