uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armando Stellato" <stell...@info.uniroma2.it>
Subject RE: RegexAnnotator ruleExceptions problem
Date Thu, 18 Jun 2009 14:21:40 GMT
Dear Micheal,

I finally managed to find some time to delve into the code (was not
difficult at all, and the code is pretty clear, only I had to find the time
:-) ), with regard to the problem I reported a few months ago about
exceptions not working.

>From what I saw on the code (and, by reading the manual again, it clarified
this point quite well), I checked that the exception is not taken on the
annotation that *would be* taken (the one matched by the rule regEx), but on
the annotation of the matchtype specified in the rule. So, for example, the
following exception, would never have been raised:

			<rule ruleId="fullNumberedDate"

Because it does not even reach the point where the exception regex (i.e.
".") is tested. It first checks for the annotation of type Date which covers
the unique DocumentAnnotation which is available. That is obviously always
null because there is no annotation "covering" DocumentAnnotation.

So, two things come to my mind:

1) If I really want to use exceptions, I cannot use DocumentAnnotation as
the matchtype for the rule, this derives from what I told above. I should
instead specify some type, for the rule matchType, which contains the info I
want to annotate, more than covering it, such as TokenAnnotation (it seemed
strange to me since token annotation is made of "pieces" of the text which
is being searched for, but probably your annotator assembles them together
and is able to run regexps over this assembled text), this is confirmed by
the example before section 2.4.1, where you use TokenAnnotation instead of
DocumentAnnotation as a matchtype, in particular by using normalized tokens
instead of the original strings.
I tried it in the attempt to extract potential years from four digit numbers
(previously taken as Number annotations) and it worked.

2) however, this approach has two limitations:
   a) it requires that there is some kind of underlying annotation (Token,
or Number, in the above case), as in the opposite case, you must use
DocumentAnnotation, thus not allowing for the use of Exception.
   b) even when you are in the above conditions, the exception is still
applied to the context specified by the annotation of type given by the
exception which covers the annotation of type given by the rule, while it
could be important to be able to process the covering annotation of the
specific *to be extracted* annotation. For example, in the original case I
submitted months ago, I had Date and GenericNumber annotators in reverse
ordering, with Date preceding GenericNumber, and I was trying to make an
exception on numbers by not considering numbers which were covered by an
already taken date. If I had the possibility of checking the covering
annotation for my extracted numbers, I would have used a n exception like
the following one:


To exclude all numbers which were already annotated as Dates (the "." always
matches, so it suffices to have a covering Date annotation).

Unfortunately, with the standard behavior of exceptions, the declared
matchtype was DocumentAnnotation (I had no tokens) and I was not able get a
covering for them (neither it would make sense)

Do you think that the proposed change would be a nice improvement for a next
version of your RegularExpression Annotator?



> -----Original Message-----
> From: Armando Stellato [mailto:stellato@info.uniroma2.it]
> Sent: Tuesday, April 28, 2009 9:15 PM
> To: uima-user@incubator.apache.org; mba@michael-baessler.de
> Subject: RE: RegexAnnotator ruleExceptions problem
> Hi,
> > you may also know that "." for your match is a predefined character
> classes
> > that match any
> > character. If you only want to match the dot (.) you have to specify
> the dot was thought exactly to match any character, thus matching any Date
> annotation (any, because the dot always match) and thus not producing
> (excluding) any Number annotation which would have been taken inside it.
> > If you are able to debug in your env - the interesting piece of code is
> >
> > org.apache.uima.annotator.regex.impl.RegExAnnotator.java at line 487.
> Thanks, I'll check it :-)

View raw message