Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Received-SPF: pass (athena.apache.org: domain of msa@schor.com designates
 67.18.62.20 as permitted sender)
Message-ID: <5176A773.8040309@schor.com>
Date: Tue, 23 Apr 2013 11:23:31 -0400
From: Marshall Schor <msa@schor.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130328 Thunderbird/17.0.5
MIME-Version: 1.0
To: user@uima.apache.org
Subject: Re: Using PEAR in a application based on Uima framework
References: <loom.20130415T082051-720@post.gmane.org>
 <loom.20130423T033949-539@post.gmane.org>
In-Reply-To: <loom.20130423T033949-539@post.gmane.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

To combine annotators written by different groups at different times does
require some level of conformity in the type system, or the creation of type
conversions or mappers.

Some use cases:  Let's presume you have some annotators that work off of
"tokens".  Let's now presume you have an upstream annotator that annotates
tokens.  Let's now say you build a system using both of these.  If the tokenizer
produces types x.y.z.Token (I'm fully qualifying the name of the type), then the
user of these tokens would need to iterate over the type "x.y.z.Token" to get
the tokens to work on.

If later, you have a better tokenizer, there are two sub-cases.  In the first
case, say you find a better tokenizer (let's imagine it properly handles tokens
for non-western languages, or handles multi-word tokens, etc.).  If that
tokenizer produces tokens of type x.y.z.Token (unlikely if it comes from another
sources, or maybe likely if it is version 2 of the original tokenizer), then you
can just "plug it in".  This is true even if the type x.y.z.Token adds some new
features (which your downstream annotator may not define), such as
"multi_word".  This is OK because UIMA, before starting processing, collects all
the type systems and merges the types - so that if annotator 1 defines type
x.y.z.Token as having a "multi_word" feature, but annotator 2 doesn't, then the
merged type definition will in any case have a slot for that feature.

On the other hand, if your use case is one where the new tokenizer is from
another company, and it produces token annotations of type a.b.c.TTT, then your
downstream annotator which is looking tokens of type x.y.z.Token won't find
any.  There, you have to either re-write your annotator to use the new kind of
token, or insert some kind of type mapping annotator inbetween. Sometimes the
type mapping can be trivial, and other times, it can be arbitrarily complex.

When UIMA was first being conceived, there was some thought given to trying to
"standardize" on type systems, to minimize these kind of issues, but looking at
the vast and diverse community of people and projects working in this area, it
was felt that this was too difficult to accomplish.  So UIMA has somewhat of a
compromise - an ability to "merge" different type systems, effectively creating
a union of all the types and features.

HTH

 -Marshall

On 4/22/2013 9:44 PM, swirl wrote:
> swirl <swirlobt@...> writes:
>
>> I am currently developing a Tomcat application that wraps around Uima to 
> run 
>> text mining processes. 
>> I am confused over what PEAR can be used for and how it can be used in a 
> Uima-
>> wrapped application.
>>
>> The application is to be deployed as a installed web application at our 
>> client's location and it is meant to be more or less a black box to our 
>> client. That is, our client should not need to know about the intricracies 
> of 
>> Uima or the various analysis engines to perform text mining processes.
>> Our application presents them a simple facade that thats in input from 
> them, 
>> runs the input through an analysis pipeline (consisting of annotators, cas 
>> consumers, etc) and returns an analysed, annotated document to them.
>>
>> But we also want our application to be easily extensible and changed, in 
> case 
>> we have a better version of analysis engine, we want to deploy just the 
> engine 
>> to the client without having to re-compile and re-deploy the whole 
>> application.
>>
>> Can we make use of PEAR to do the deployment?
>> If so, what about the types used in the analysis engines in the PEAR, how 
> does 
>> the deployed application know about the new or modified types in the PEAR?
>>
>>
>
>
> Erhmmm, has anybody do something like this before?
> I really am interested to know how you can do it.
>
> To clarify, I am very interested in how you can mix-match different PEARs, 
> possibly from different open source projects, with different type systems, 
> and run them in a pipeline as a coherent whole.
>
> How do you resolve the issue that all their type systems are of different 
> Java types and be able to use each other's analysis results in the pipeline.
>
> Thanks!
>
>
>