Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CB96C187B9 for ; Thu, 16 Jul 2015 16:53:00 +0000 (UTC) Received: (qmail 78234 invoked by uid 500); 16 Jul 2015 16:52:55 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 78188 invoked by uid 500); 16 Jul 2015 16:52:55 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 78177 invoked by uid 99); 16 Jul 2015 16:52:55 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2015 16:52:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E14F61A7166 for ; Thu, 16 Jul 2015 16:52:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.991 X-Spam-Level: X-Spam-Status: No, score=0.991 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, KAM_LAZY_DOMAIN_SECURITY=1, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id miOy8YDgg3RG for ; Thu, 16 Jul 2015 16:52:47 +0000 (UTC) Received: from machine.or.cz (pasky.or.cz [84.242.80.195]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTP id E206221143 for ; Thu, 16 Jul 2015 16:52:46 +0000 (UTC) Received: by machine.or.cz (Postfix, from userid 2001) id BCB984821167; Thu, 16 Jul 2015 18:52:41 +0200 (CEST) Date: Thu, 16 Jul 2015 18:52:41 +0200 From: Petr Baudis To: user@uima.apache.org Subject: Re: UIMAj3 ideas Message-ID: <20150716165241.GK2760@machine.or.cz> References: <20150709225200.GL2760@machine.or.cz> <55A002D7.7050808@schor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55A002D7.7050808@schor.com> User-Agent: Mutt/1.5.23 (2014-03-12) On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote: > On 7/9/2015 6:52 PM, Petr Baudis wrote: > > > https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3 > > > I didn't figure out how to edit that wiki page, > Due to spammers, we had to turn off public editing. However, I can add you to a > list ( to do this, you have to "register" for a user id on the wiki, and then > send me offline what that Id is ), but even without being on the list, there's a > comment button which (I think) lets you add comments at the bottom. > > but a mental summary > > of the things I find currently irritating about UIMA and would love to > > see changed formed in my mind, so I thought I could contribute it for > > discussion. > Great! > > > > * UIMAfit is not part of core UIMA and UIMA-AS is not part of core > > UIMA. It seems to me that UIMA-AS is doing things a bit differently > > than what the original UIMA idea of doing scaleout was. The two > > things don't play well together. I'd love a way to easily take > > my plain UIMA pipeline and scale it out, ideally without any code > > changes, *and* avoid the terrible XML config files. > Any specifics of what to change here would be helpful. UIMA-AS was designed to > enable scale-out without changing the core UIMA pipeline or it's XML > descriptor. THe additional information for UIMA-AS scaleout was put into a > separate xml descriptor which "embeds" the original plain UIMA one. I'm sure Richard would be able to explain this better, but I think one of the core issues is that UIMA-AS embeds the XML descriptor instead of the AnalysisEngineDescription. So when I want to use it together with AnalysisEngineDescription built with UIMAfit instead, it's time to start making crazy workarounds like https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/component/SimpleService.java?name=14aeba50c8c1&r=14aeba50c8c18ea4d14c0d099f43c049f806d9db > > * Connected with the above - I'd love .addToIndexes() to just > > disappear. Right now, the paradigm is that you build an annotation > > in an annotator, and the moment it gets saved in a CAS, it becomes > > basically read-only. > You certainly can modify any of an Annotation's features subsequently. > I'm guessing you're referring to another idea - adding additional features that were > not initially defined in the UIMA type system. Sorry for the confusion, but that's not quite what I had in mind. I literally believe that right now, in order to modify value of a feature, you need to first remove it from an index, change the value, then re-add it back. Is that a misconception? > UIMA sets up the types and > features once at the start of the pipeline run (from a merge of all the > component's type systems), and locks down the type system. Other frameworks > sometimes allow an unlocked type system, where you could add (after a Feature > Structure is created) additional features. This is usually done by keeping a > list of feature-name <-> feature-value pairs (such as your code snippet does, > below). We're thinking of including this capability in the version 3, with a > bit of a twist - the intent would be to keep the "compilable" aspect of > "locked-down" type/features (for high performance), while adding (for those use > cases that want it) the other style of dynamically added additional features (at > some cost in performance). Still, this would be awesome and I'd totally make use of it! (The code in my original email I guess conflates demonstration of two issues - the addToIndex and lack of variable-sized lists, i.e. the java collection support issue. Even if you decide generic collection / map support would be too tricky, at least supporting variable-sized lists would help a lot...) > > * I wondered about storing (arbitrary) graphs in the CAS, but the > > issues above make this really impractical. If you also think about > > integrating microformats, you need to think about how to do this. > We have had users store arbitrary graphs in the CAS, but, yes, it is not so > efficient. The main element UIMA has for collections of references (to > FeatureStructures) are the FSArray and FSList. As you point out the FSArray is > fixed length. The FSList supports dynamic adding/removing etc. using the > standard link-list technology. However, because UIMA data in the CAS > (currently) is not garbage collected, you have to be careful when using this > technique. ...oh, never mind. After using UIMA heavily for well over a year, I managed not to learn that FSList exists at all! Thanks for this pointer. I think that's a bug for the UIMA Tutorial, which mentions FSArray but not FSList. :-) (Another pain point here - I always ache when I need to work with FSArray or I guess FSList, since it does not carry the type information that is in the typesystem - I need to manually typecast all the time and hope I don't make a mistake.) > The above proposal to allow the common Java Collection objects (like ArrayList, > and Maps) as things in the CAS, plus garbage collection,should make it much more > convenient to store and work with graphs in the CAS. > > > > * Complex pipelines are a bit clumsy. I think the biggest obvious > > problem is lack of signalling to CAS merger that input CASes have > > been exhausted. Having an "isLast" barrier sounds simple as long > > as you have only a single CAS multiplier paired with the CAS merger, > > but when this assumption breaks down, things start to deteriorate. > > However, I realize complex pipelines are a niche area. > It would be nice to hear some ideas here. (After reading Eddie Epstein's email and coming back to some more of his emails to me, I realize that the isLast hack I'm using is needless if I would instead use the "process-parent-last" flag of CASMultiplier. I'm learning a lot from interacting here! I guess that shows we could always make use of more good UIMA code examples...) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton