Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: uima-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Asynchronous UIMA (workflow) ?
Date: Fri, 12 Oct 2007 15:32:52 +0200
Message-ID: <F3BED77B64E715419071BBFC02F9F1BC0241D8CA@EXV01001.GlobalSP.local>
Thread-Topic: Asynchronous UIMA (workflow) ?
Thread-Index: AcgMKrNXmuwcj6zaQguRKZxlqt7vTgAPQWNw
References: 
 <101120071717.22259.470E5AC50007889B000056F32200734076C0C0CFCD099D0A0D03040108@comcast.net>
From: "Pascal Coupet" <pascal.coupet@temis.com>
To: "greg@holmberg.name" <holmberg2066@comcast.net>,
	<uima-user@incubator.apache.org>

Hi Greg,

I agree with you that human intervention may be often needed in NLP =
related applications. In an editorial system by example, you may want a =
review and validation of categories assigned automatically. However, I'm =
not sure that this should be done within an UIMA pipeline. The UIMA =
framework is a middleware and is not the whole application. It looks =
difficult to me to manage that if 10 docs go into a pipeline, 9 will go =
through at a normal pace and one will get stuck somewhere 1 or 2 days =
for manual intervention. The framework is distributed and relies on =
timeouts to detect errors. You will have to do something special to not =
fail in error for this document and hope that the user will not forget =
to do the job.

If I go back to the editorial system example, it may require getting the =
document quickly within the system even if some annotators did fail on =
it. The application can then make some decisions depending on the =
missing parts (ask an editor to complete, hide the document ...)

One way to handle errors is simply to store them within the CAS. =
Subsequent annotators can make decisions depending on previous errors. =
In your example, No entity extraction will be make because no category =
is available and the annotator will log an error "unable to extract =
entities..." which is different than finding no entity. The application =
receiving the CAS at the end of the workflow will propose it to an =
editor who will select categories and then resubmit the document to the =
annotation workflow to get it completed.=20

I think that the whole purpose of the UIMA framework is to glue together =
various annotation engines and manage properly to distribution of the =
work across machines. One workflow can be seen as a meta annotator which =
has business meaning to your company or research center. It can be =
ideally reused by different applications. So I will try to avoid as much =
as possible to have application specific actions directly encoded within =
it.=20

=20
Pascal
=20
=20

Pascal Coupet
Chief Technology Officer & Co-founder

TEMIS INC
1518 Walnut Street, suite 1702, Philadelphia, PA 19102, USA
 Tel:   +1 215 732 2549 ext 112=20
Mob:   +1 215 609 2514
Fax:   +1 215 732 0490=20
www.temis.com
=20

Strictly Personal and Confidential
This message may contain confidential and proprietary material for the =
sole use of the intended recipient. Any review or distribution by or to =
others is strictly prohibited. If you are not the intended recipient, =
please contact the sender and delete all copies.
-----Original Message-----
From: greg@holmberg.name [mailto:holmberg2066@comcast.net]=20
Sent: Thursday, October 11, 2007 1:18 PM
To: uima-user@incubator.apache.org; uima-user@incubator.apache.org
Cc: Pascal Coupet
Subject: RE: Asynchronous UIMA (workflow) ?

Pascal--


I was thinking essentially the same thing: serialize the CAS to a file =
or database, do your human interaction (possibly including the CAS =
Editor), then reload it and resume processing.

It would be nice to generalize it, rather than have two explicit =
analysis engines.  So a nice enhancement to UIMA would be the ability to =
persist not just the CAS but the state of the engine along with it, so =
that it could be stopped and restarted at any point.

For my purposes, this would be useful if say, one annotator depended on =
finding certain data in the CAS from another annotator, but that earlier =
one failed or didn't produce the right data, and I need a user to =
produce the data manually.

For example, if a taxonomy classifier runs first and a named entity =
extractor runs second, and the entity extractor wants to select a name =
catalog to use based on the classification ("if classified biology, use =
biology NC, else if classified chemistry use chemistry NC"), but the =
classifier doesn't classify at all, or doesn't classify into the right =
catefgory (not biology or chemisty), then I would want the user to =
classify it manually.  So I would persist that document and engine =
state, notify the user, who would classify it, and then restart the =
engine, which would then move on to run the entity extractor with an NC =
based on the user's classification.

Not knowing in advance where in the engine the failure will occur =
(failure to classify being only one possibility), I can't create two =
explicit engines.  Having a general mechanism to persist the state of =
the engine would let me handle any failure or missing dependency.  NLP =
being generally an imprecise process, I foresee human intervention in =
the pipeline as a not-infrequent occurance.  So having a mechanism to =
deal with that in a general way would be helpful.

This is not a high priority enhancement for me at the moment, just an =
idea for us to kick around.


Greg Holmberg


 -------------- Original message ----------------------
From: "Pascal Coupet" <pascal.coupet@temis.com>
> Hi Thomas,
>=20
> =20
>=20
> I think a way to do it is to split this process across 2 workflows. =
The first=20
> consumer will get the CAS, eventually store it in XML somewhere (file, =
database=20
> ...). A small application will manage the interaction with the user =
(sending=20
> mail, reminders ...), watch a return address mailbox, update the XCAS =
and make=20
> it available. The source of the second workflow will watch for =
available updated=20
> XCAS and continue from there. You can in theory make the consumer of =
the first=20
> workflow to send the mail and the source of the second watch for =
incoming emails=20
> but it will be more difficult I think to manage properly the =
interaction with=20
> users (reminder to responds, statistics, routing configuration ...).=20
>=20
> =20
>=20
> Just some thoughts, =20
>=20
> =20
>=20
> Pascal
>=20
> ________________________________
>=20
> From: Thomas Francart [mailto:thomas.francart@mondeca.com]=20
> Sent: Thursday, October 11, 2007 7:01 AM
> To: uima-user@incubator.apache.org
> Subject: Asynchronous UIMA (workflow) ?
>=20
> =20
>=20
>=20
> Hi all -
>=20
> I'm thinking about whether or not it would be possible to add an =
asynchronous=20
> step in a UIMA pipeline ? For example having an analysis engine that =
would ask=20
> for a user input or a user review of a CAS, or something like that. =
Well my=20
> point is that at some point in the pipeline, I would like a user to =
review the=20
> state of the CAS, maybe add some more information, delete some others, =
and so=20
> on; and then the rest of the pipeline would continue upon user =
validation. (by=20
> "user" here I don't mean someone that sits in front of a computer and =
watch the=20
> UIMA processing taking place, but maybe someone receiving an email =
saying "hey,=20
> you should have a look and validate that").
>=20
> I know this a generic workflow question, but I was just wondering if =
some other=20
> people had the same question/requirements with a UIMA integration, and =
if you=20
> had some ideas on how it could be adressed/solved.
>=20
> Best,
> Thomas
>=20
> --=20
>=20
> Thomas Francart=20
> Mondeca=20
> 3, cit=E9 Nollez 75018 Paris France=20
> Tel: +33 (0)1 44 92 35 04 - Fax: +33 (0)1 44 92 02 59=20
> Blog: mondeca.wordpress.com=20
> Web: www.mondeca.com=20
> Mail: thomas.francart@mondeca.com=20
>=20
>=20