Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F468639F for ; Fri, 29 Jul 2011 09:42:56 +0000 (UTC) Received: (qmail 13094 invoked by uid 500); 29 Jul 2011 09:42:55 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 12763 invoked by uid 500); 29 Jul 2011 09:42:42 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 12750 invoked by uid 99); 29 Jul 2011 09:42:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 09:42:39 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nicolas.hernandez@gmail.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vx0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 09:42:30 +0000 Received: by vxh2 with SMTP id 2so2838828vxh.6 for ; Fri, 29 Jul 2011 02:42:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:reply-to:from:date:message-id:subject:to:content-type; bh=tp4QAx0O5XZEzI/o9hMeUb5p4KvOWDnULDsVSimvMVc=; b=DMOKlfmfxTX9W0+XcLNahqhYu2f9I45dIMZQ+wSSbgpN5oj/Y+xzv2KOcO4RUP6yw2 WKkG8FbRXCP34Ja5LmJCFLCeYYyabYFVccOgc0HgP8d1L14FlqHlYMb/oGOqBGA6Elzy MlNZEc6PLhp+SwwOwukku98yCu8GYMfgA/uuc= Received: by 10.52.183.42 with SMTP id ej10mr1056959vdc.451.1311932530111; Fri, 29 Jul 2011 02:42:10 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.181.165 with HTTP; Fri, 29 Jul 2011 02:41:50 -0700 (PDT) Reply-To: nicolas.hernandez@univ-nantes.fr From: Nicolas Hernandez Date: Fri, 29 Jul 2011 11:41:50 +0200 Message-ID: Subject: New features for the Apache UIMA Regular Expression Annotator To: user@uima.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi Everyone I tested the Apache UIMA Regular Expression Annotator to know its abilities to formulate recognizing rules. I tested it to recognize named entities. Being said it only works on text characters, I mainly encountered two limitations. I'd like to know what you think about, and if you think that future evolutions of the annotator could fix them. Roughly speaking, my problems started when I tried to handle several concepts and when my rules reach a high level complexity. First since the regex variables are also regex I used them as a dictionnary of elements for my rules (e.g. ). The elements are also regex which has some advantages (e.g. ) . The major drawback is when your dictionnary has several hundred or thousand of lexical entries. It it is tedious to keep the dictionnary up-to-date or even to handle and edit the file. It would be great if the variable values could also be defined in external files (one entry per line). This solution also allows to define once some variables and to use them as many times as you want in distinct rule files (which is also appealing to keep up to date the rules). Second, it is possible to set a priority order between rules of a same concept but not between concepts. In practice some distinct concepts may have similar rules (e.g. person entity and location entity) you may wish to set a priority between them to avoid some ambiguity to handle ouside of the annotator (currently to avoid this situation you have to define the recognizing rules of the person and the location entities in the same concept which is not conceptually acceptable). Offering a way to set priority between concepts will lead to the problem of how to do it when the concepts are defined in distinct files. I agree the ambiguity problem may be handled in further annotators. Regards /Nicolas