uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: question on REGEXP in Ruta
Date Mon, 08 Feb 2016 14:19:52 GMT

Am 08.02.2016 um 14:49 schrieb Bonnie MacKellar:
> Hi,
> Thanks. This is very useful. I did not realize I could use the BLOCK
> construct in this way. It completely did the trick for me. However, I have
> to admit I don't understand your inlined rule
> Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
> I thought that anything after the -> should be a result, not a match. I
> think I do not understand Ruta syntax very well, even though I have read
> the guide a bunch of times, And that leads to my next concern...

The documentation about inlined rules is here:

There are two types of inlined rules: as conditions or as actions.
The inlined rules are placed after the normal condition-action-part and
the type of the inlined rules are indicated by an arrow.
"->" means: treat them as actions
"<-" means: treat them as conditions

The "->" arrow is normally also used to separate conditions from
actions, but here it indicates inlined rules.
The overall typical syntax for an inlined rule (without some other
stuff) looks something like:
Inlined rules as actions are almost the same as a BLOCK construct, but a
bit more powerful since they can be restricted to a rule element in a
larger rule.
The rules within the brackets are applied within the context of the rule
element if the overall rule was able to match.
For your example with line, this means that the rule successfully
matched on a line, then the inlined regexp rule tries to apply within
the given line, which is every line in the document.

Inlined rules as condition are evaluated when the conditions of the rule
element are evaluated. The rule element does only match if one of the
inlined rules was able to match.

Right now, there are some restrictions of the debug-explanation of
inlined rules in the Ruta Workbench. The explanation works better for

> I do want to move this to annotations instead of strictly using regular
> expressions. We have an extensive set of regular expressions which were
> published by another group, and what we are trying to do is a) run these
> expressions against a dataset which is different from the original group's
> data, to see if we get the same statistics for matches, and then 2) try to
> improve the matches with annotations obtained from the Metamap annotator.
> To do that, I will need to switch to using annotations. I am using the
> regular expressions now, and using them pretty much as they were written
> for better or worse, to make sure we don't change the meaning in any way
> while getting the first set of statistics.
> I have been trying to rewrite these rules using annotations, but I am
> totally failing at it. Everything I try gives me syntax errors. If I wanted
> to change this example rule to something that used annotations, what would
> it look like? This is obviously totally wrong and horrible
> LINE({ANY REGEXP((?:no|No)) ANY {CONTAINS UmlsConcept} ->MARK SomeTag})
> what I am trying to say here is that a line that contains any characters
> followed by the regular expressions (no|N0} followed by characters that
> have been annotated with UmlsConcept should be marked with SomeTag. I just
> can't figure out where to put the braces and arrows.

Does this work for you:

Line-> {ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};};

The rule matches on every Line annotation, which is sucessful for every
visible line in the document. Then, the inlined rule is applied for
every match (=Line): ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};
The @ says that the rule should not start to match with the first rule
element but with the rule element with the @ in front of it. This is
just an optimization because ANY would match on every token.
This is the second rule element: @UmlsConcept{->SomeTag}
This rule element just matches on an annotation of the type UmlsConcept.
If there was an UmlsConcept annotation, the rule continues the matching
process with the remaining rule elements, that is the first one. It
checks if the previous token (ANY) fulfills the regular expression
"[Nn]o". If that is also sucessful, then the only action of these rules
is executed. The implicit action creates a new annotation of the type
"SomeTag" on the offests of the UmlsConcept annotation.

Does this help? Do not hesitate if you have more questions, e.g., for
other rules.

btw, if you have more rules like this, I'd recommend that you use a
BLOCK construct and separately annotate the negation indicators with a
special type.



> Anyway, my regular expressions seem to be working now!
> Bonnie MacKellar
> On Mon, Feb 8, 2016 at 4:48 AM, Peter Klügl <peter.kluegl@averbis.com>
> wrote:
>> Hi,
>> capturing groups are not supported by the REGEXP condition since it is
>> essentially just a boolean function and cannot transfer its internal
>> information to an action which creates annotations. However, there are
>> many other ways to solve it.
>> There is maybe a problem with your regexp. I changed it to ".*(?:no|No)
>> (.*)" in the following.
>> You can, for example, use the simple regexp rule and restrict its
>> matching context to each line:
>> ... with a BLOCK:
>> BLOCK(eachLine) Line{}{
>>     ".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;
>> }
>> ... with an inlined rule:
>> Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
>> Some additional comments:
>> You should mention the type Line in the EXEC action for reindexing, if
>> you want to use these annotations in the following rules:
>> Document{-> EXEC(PlainTextAnnotator, {Line})};
>> For your rules, it does not make a difference, but if you use other
>> conditions like PARTOF, it will not work correctly.
>> From my experience, I'd recommend to work directly with annotations
>> instead of regexes for detecting the target of a negation. Then, you can
>> refactor the rules more easily, e.g., if you have a rule like
>> Line->{PrefixNegationInd #{-> Group1};}; you can replace the wildcard
>> with something better in future like ChunkNP. (I just wanted to mention
>> it. I know that your example was probably just an example to describe
>> the problem with ruta.)
>> Best,
>> Peter
>> Am 08.02.2016 um 00:37 schrieb Bonnie MacKellar:
>>> Hi,
>>> I am trying to write RUTA rules using regular expressions and capturing
>>> groups. I want the matches to be line by line. I can do this using the
>>> following script
>>> ENGINE utils.PlainTextAnnotator;
>>> TYPESYSTEM utils.PlainTextTypeSystem;
>>> Document{-> RETAINTYPE(BREAK)};
>>> Document{-> EXEC(PlainTextAnnotator)};
>>> DECLARE Rule1NoPattern, Group1, Group2;
>>> Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
>>> Given this text
>>> Not pregnant or nursing
>>> Fertile patients must use effective contraception (hormonal contraception
>>> or intra-uterine device [IUD])
>>> No concurrent participation in another clinical trial that would preclude
>>> the interventions or outcome assessment of this clinical trial
>>> No other concurrent anticancer therapy
>>> it correctly matches the last two lines and annotates them with
>>> Rule1NoPattern
>>> The problem is, I want to use the capturing group information as well. I
>>> can do this using the simple regular expression syntax
>>> ".*no|No (.*)\n|S" -> Rule1NoPattern, 1=Group1;
>>> if I just give it one line, say
>>> No other concurrent anticancer therapy
>>> it will correctly annotate the entire line with Rule1NoPattern, and
>> "other
>>> concurrent anticancer therapy" wll be annotated with Group1.
>>> Is there a way, using the first rule variant
>>> Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
>>> to annotate the text in capturing group?
>>> I have tried all kinds of syntax, but none of it seems to be correct
>>> thanks,
>>> Bonnie MacKellar

View raw message