uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: UIMAj3 ideas
Date Fri, 10 Jul 2015 17:37:27 GMT
On 7/9/2015 6:52 PM, Petr Baudis wrote:
<snip...>

https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3

>   I didn't figure out how to edit that wiki page, 
Due to spammers, we had to turn off public editing.  However, I can add you to a
list ( to do this, you have to "register" for a user id on the wiki, and then
send me offline what that Id is ), but even without being on the list, there's a
comment button which (I think) lets you add comments at the bottom.
> but a mental summary
> of the things I find currently irritating about UIMA and would love to
> see changed formed in my mind, so I thought I could contribute it for
> discussion.
Great!
>
>   * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
>     UIMA.  It seems to me that UIMA-AS is doing things a bit differently
>     than what the original UIMA idea of doing scaleout was.  The two
>     things don't play well together.  I'd love a way to easily take
>     my plain UIMA pipeline and scale it out, ideally without any code
>     changes, *and* avoid the terrible XML config files.
Any specifics of what to change here would be helpful.  UIMA-AS was designed to
enable scale-out without changing the core UIMA pipeline or it's XML
descriptor.  THe additional information for UIMA-AS scaleout was put into a
separate xml descriptor which "embeds" the original plain UIMA one.

>
>   * Speaking of avoiding the config files, it'd be nice if I could avoid
>     them for type systems as well.  A radical idea: In the end, I treat
>     UIMA essentially as a storage for Java objects; I suspect many others
>     do the same.  I'd love a way to turn JCasGen on its head and write
>     the Java classes (possibly with some restrictions) that I could
>     store in UIMA, with the backend figuring out the low-level UIMA
>     representation on its own.  This would radically reduce some aspects
>     of the engineering overhead for me and maybe many other users.
Interesting idea.  I'll add it to the list.
>   * The JCas UIMA interface should be more transparent in other ways
>     too.  Working with arrays (and absence of lists) is a huge pain.
>     I just want to work with feature structures as if they were normal
>     Java objects, without major restrictions.
This is one of the version 3 ideas: see
https://cwiki.apache.org/confluence/display/UIMA/Supporting+Java+Collections+and+Maps+as+UIMA+Feature+Structures
>
>   * Connected with the above - I'd love .addToIndexes() to just
>     disappear.  Right now, the paradigm is that you build an annotation
>     in an annotator, and the moment it gets saved in a CAS, it becomes
>     basically read-only.  
You certainly can modify any of an Annotation's features subsequently.  I'm
guessing you're referring to another idea - adding additional features that were
not initially defined in the UIMA type system.  UIMA sets up the types and
features once at the start of the pipeline run (from a merge of all the
component's type systems), and locks down the type system.  Other frameworks
sometimes allow an unlocked type system, where you could add (after a Feature
Structure is created) additional features.  This is usually done by keeping a
list of feature-name <-> feature-value pairs (such as your code snippet does,
below).  We're thinking of including this capability in the version 3, with a
bit of a twist - the intent would be to keep the "compilable" aspect of
"locked-down" type/features (for high performance), while adding (for those use
cases that want it) the other style of dynamically added additional features (at
some cost in performance).  
> But if I want e.g. to build up a set of
>     features across multiple annotators, things again become very
>     painful.  Because also fixed-size arrays, I need awful boilerplate
>     code like
>
>                 AnswerInfo ai = JCasUtil.selectSingle(jcas, AnswerInfo.class);
>                 AnswerFV fv = new AnswerFV(ai);
>                 fv.setFeature(f, 1.0);
>
>                 for (FeatureStructure af : ai.getFeatures().toArray())
>                         ((AnswerFeature) af).removeFromIndexes();
>                 ai.removeFromIndexes();
>
>                 ai.setFeatures(fv.toFSArray(jcas));
>                 ai.addToIndexes();
>
>     simply to add a feature.  (Note the AnswerFV class, which is the
>     actual thing I want to store in a JCas - a dynamic list of
>     (feature_label, feature_value) pairs - but to do that it ends
>     up being instead a complex factory of JCas FSes with a lot more
>     boilerplate code inside.  Also note the typecast.)
>
>   * I wondered about storing (arbitrary) graphs in the CAS, but the
>     issues above make this really impractical.  If you also think about
>     integrating microformats, you need to think about how to do this.
We have had users store arbitrary graphs in the CAS, but, yes, it is not so
efficient.  The main element UIMA has for collections of references (to
FeatureStructures) are the FSArray and FSList.  As you point out the FSArray is
fixed length.  The FSList supports dynamic adding/removing etc. using the
standard link-list technology.  However, because UIMA data in the CAS
(currently) is not garbage collected, you have to be careful when using this
technique.

The above proposal to allow the common Java Collection objects (like ArrayList,
and Maps) as things in the CAS, plus garbage collection,should make it much more
convenient to store and work with graphs in the CAS.
>
>   * Complex pipelines are a bit clumsy.  I think the biggest obvious
>     problem is lack of signalling to CAS merger that input CASes have
>     been exhausted.  Having an "isLast" barrier sounds simple as long
>     as you have only a single CAS multiplier paired with the CAS merger,
>     but when this assumption breaks down, things start to deteriorate.
>     However, I realize complex pipelines are a niche area.
It would be nice to hear some ideas here.
>
>   I think these are my main concerns.  I guess another way to phrase it:
> I came to UIMA looking for a way to generate, store and organize
> my+3rdparty Java object annotations of various text-based entities.
> It sort of delivers, but if I did this again, I'd seriously hesitate
> if the steep learning curve and incredible engineering overhead is worth
> the deal.  I want to suggest that UIMAj3 would make me not hesitate, and
> get out of my way! :)
Some of the other things we're thinking about are ways to get more out of the
way and integrate with other "popular" systems.  Any constructive thoughts here
are appreciated!

Thanks for your input.

-Marshall

Mime
View raw message