uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: question on REGEXP in Ruta
Date Tue, 09 Feb 2016 08:40:43 GMT

Am 08.02.2016 um 23:35 schrieb Bonnie MacKellar:
> Hi,
> Thanks for the explanation of inlined rules. I think I get it. One
> followup: we have a lot of these patterns, probably several hundred. And
> the data files are not small. Is there a difference, performance wise,
> between using the block construct and inserting a bunch of rules, for
> example
> BLOCK(eachLine) Line{}{
>     "(?i).*(?:no) (.*)" -> Rule1NoPattern, 1=Group1;
>     "(?i)(.*)(not (?:allowed|permitted)).*" -> Rule1NoPattern, 1=Group1;
>     "(?i).*(no (?:patients?|subjects?|women|woman|men|man|children|child)
> with)(.*)"->Rule1NoPattern, 1=Group1;
>     "(?i).*(no .*eviden(?:ce of|t)) (.*)" -> Rule1NoPattern, 2=Group1;
>     // and so on
> }
> or doing this with the inlined rules?

There are no evaluations, but I'd say that there is no real difference
if you put all inlined rule in one rule element. In contrast to other
rule based systems, the peformance of ruta mainly depends on how you
write the rule, e,g,. kind of conditions, outsourcing in wordlists,
matching order and so on.

In my eprience, many regexes will maybe become slow for large documents.

> On this rule
> Line-> {ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};};
> If I understand it correctly, I don't think it will do the trick. This
> would match the UMLSconcept first, right? and then match
> ANY{ANY{REGEXP("[Nn]o")},
> right? We care that the UMLSConcept annotation occurs somewhere after the
> NO patterns, because it helps us identify what it is that there can be NO
> of. We will be handling "No total body irradiation in the past" differenlty
> from "no pregnancy". The annotation won;t actually be so generic as
> UMLSConcept, but rather will be based on semantic type (I have an annotator
> I wrote that does that part).

Ah ok. The @ should not chnage the result in this rule. If the negation
does not need to occur directly before the concept, then you can add
other (optional) types in between or a wildcard.



> thanks so much for your help. I have been able to move forwards quite a bit
> today because of it.
> Bonnie MacKellar
> On Mon, Feb 8, 2016 at 9:19 AM, Peter Klügl <peter.kluegl@averbis.com>
> wrote:
>> Hi,
>> Am 08.02.2016 um 14:49 schrieb Bonnie MacKellar:
>>> Hi,
>>> Thanks. This is very useful. I did not realize I could use the BLOCK
>>> construct in this way. It completely did the trick for me. However, I
>> have
>>> to admit I don't understand your inlined rule
>>> Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
>>> I thought that anything after the -> should be a result, not a match. I
>>> think I do not understand Ruta syntax very well, even though I have read
>>> the guide a bunch of times, And that leads to my next concern...
>> The documentation about inlined rules is here:
>> https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.inlined
>> There are two types of inlined rules: as conditions or as actions.
>> The inlined rules are placed after the normal condition-action-part and
>> the type of the inlined rules are indicated by an arrow.
>> "->" means: treat them as actions
>> "<-" means: treat them as conditions
>> The "->" arrow is normally also used to separate conditions from
>> actions, but here it indicates inlined rules.
>> The overall typical syntax for an inlined rule (without some other
>> stuff) looks something like:
>> Inlined rules as actions are almost the same as a BLOCK construct, but a
>> bit more powerful since they can be restricted to a rule element in a
>> larger rule.
>> The rules within the brackets are applied within the context of the rule
>> element if the overall rule was able to match.
>> For your example with line, this means that the rule successfully
>> matched on a line, then the inlined regexp rule tries to apply within
>> the given line, which is every line in the document.
>> Inlined rules as condition are evaluated when the conditions of the rule
>> element are evaluated. The rule element does only match if one of the
>> inlined rules was able to match.
>> Right now, there are some restrictions of the debug-explanation of
>> inlined rules in the Ruta Workbench. The explanation works better for
>> blocks.
>>> I do want to move this to annotations instead of strictly using regular
>>> expressions. We have an extensive set of regular expressions which were
>>> published by another group, and what we are trying to do is a) run these
>>> expressions against a dataset which is different from the original
>> group's
>>> data, to see if we get the same statistics for matches, and then 2) try
>> to
>>> improve the matches with annotations obtained from the Metamap annotator.
>>> To do that, I will need to switch to using annotations. I am using the
>>> regular expressions now, and using them pretty much as they were written
>>> for better or worse, to make sure we don't change the meaning in any way
>>> while getting the first set of statistics.
>>> I have been trying to rewrite these rules using annotations, but I am
>>> totally failing at it. Everything I try gives me syntax errors. If I
>> wanted
>>> to change this example rule to something that used annotations, what
>> would
>>> it look like? This is obviously totally wrong and horrible
>>> LINE({ANY REGEXP((?:no|No)) ANY {CONTAINS UmlsConcept} ->MARK SomeTag})
>>> what I am trying to say here is that a line that contains any characters
>>> followed by the regular expressions (no|N0} followed by characters that
>>> have been annotated with UmlsConcept should be marked with SomeTag. I
>> just
>>> can't figure out where to put the braces and arrows.
>> Does this work for you:
>> Line-> {ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};};
>> The rule matches on every Line annotation, which is sucessful for every
>> visible line in the document. Then, the inlined rule is applied for
>> every match (=Line): ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};
>> The @ says that the rule should not start to match with the first rule
>> element but with the rule element with the @ in front of it. This is
>> just an optimization because ANY would match on every token.
>> This is the second rule element: @UmlsConcept{->SomeTag}
>> This rule element just matches on an annotation of the type UmlsConcept.
>> If there was an UmlsConcept annotation, the rule continues the matching
>> process with the remaining rule elements, that is the first one. It
>> checks if the previous token (ANY) fulfills the regular expression
>> "[Nn]o". If that is also sucessful, then the only action of these rules
>> is executed. The implicit action creates a new annotation of the type
>> "SomeTag" on the offests of the UmlsConcept annotation.
>> Does this help? Do not hesitate if you have more questions, e.g., for
>> other rules.
>> btw, if you have more rules like this, I'd recommend that you use a
>> BLOCK construct and separately annotate the negation indicators with a
>> special type.
>> Best,
>> Peter
>>> Anyway, my regular expressions seem to be working now!
>>> Bonnie MacKellar
>>> On Mon, Feb 8, 2016 at 4:48 AM, Peter Klügl <peter.kluegl@averbis.com>
>>> wrote:
>>>> Hi,
>>>> capturing groups are not supported by the REGEXP condition since it is
>>>> essentially just a boolean function and cannot transfer its internal
>>>> information to an action which creates annotations. However, there are
>>>> many other ways to solve it.
>>>> There is maybe a problem with your regexp. I changed it to ".*(?:no|No)
>>>> (.*)" in the following.
>>>> You can, for example, use the simple regexp rule and restrict its
>>>> matching context to each line:
>>>> ... with a BLOCK:
>>>> BLOCK(eachLine) Line{}{
>>>>     ".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;
>>>> }
>>>> ... with an inlined rule:
>>>> Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
>>>> Some additional comments:
>>>> You should mention the type Line in the EXEC action for reindexing, if
>>>> you want to use these annotations in the following rules:
>>>> Document{-> EXEC(PlainTextAnnotator, {Line})};
>>>> For your rules, it does not make a difference, but if you use other
>>>> conditions like PARTOF, it will not work correctly.
>>>> From my experience, I'd recommend to work directly with annotations
>>>> instead of regexes for detecting the target of a negation. Then, you can
>>>> refactor the rules more easily, e.g., if you have a rule like
>>>> Line->{PrefixNegationInd #{-> Group1};}; you can replace the wildcard
>>>> with something better in future like ChunkNP. (I just wanted to mention
>>>> it. I know that your example was probably just an example to describe
>>>> the problem with ruta.)
>>>> Best,
>>>> Peter
>>>> Am 08.02.2016 um 00:37 schrieb Bonnie MacKellar:
>>>>> Hi,
>>>>> I am trying to write RUTA rules using regular expressions and capturing
>>>>> groups. I want the matches to be line by line. I can do this using the
>>>>> following script
>>>>> ENGINE utils.PlainTextAnnotator;
>>>>> TYPESYSTEM utils.PlainTextTypeSystem;
>>>>> Document{-> RETAINTYPE(BREAK)};
>>>>> Document{-> EXEC(PlainTextAnnotator)};
>>>>> DECLARE Rule1NoPattern, Group1, Group2;
>>>>> Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
>>>>> Given this text
>>>>> Not pregnant or nursing
>>>>> Fertile patients must use effective contraception (hormonal
>> contraception
>>>>> or intra-uterine device [IUD])
>>>>> No concurrent participation in another clinical trial that would
>> preclude
>>>>> the interventions or outcome assessment of this clinical trial
>>>>> No other concurrent anticancer therapy
>>>>> it correctly matches the last two lines and annotates them with
>>>>> Rule1NoPattern
>>>>> The problem is, I want to use the capturing group information as well.
>> I
>>>>> can do this using the simple regular expression syntax
>>>>> ".*no|No (.*)\n|S" -> Rule1NoPattern, 1=Group1;
>>>>> if I just give it one line, say
>>>>> No other concurrent anticancer therapy
>>>>> it will correctly annotate the entire line with Rule1NoPattern, and
>>>> "other
>>>>> concurrent anticancer therapy" wll be annotated with Group1.
>>>>> Is there a way, using the first rule variant
>>>>> Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
>>>>> to annotate the text in capturing group?
>>>>> I have tried all kinds of syntax, but none of it seems to be correct
>>>>> thanks,
>>>>> Bonnie MacKellar

View raw message