uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bonnie MacKellar <bkmackel...@gmail.com>
Subject Re: question on REGEXP in Ruta
Date Wed, 10 Feb 2016 15:38:41 GMT
OK, thanks for your help!

On Tue, Feb 9, 2016 at 3:40 AM, Peter Klügl <peter.kluegl@averbis.com>
wrote:

> Hi,
>
> Am 08.02.2016 um 23:35 schrieb Bonnie MacKellar:
> > Hi,
> >
> > Thanks for the explanation of inlined rules. I think I get it. One
> > followup: we have a lot of these patterns, probably several hundred. And
> > the data files are not small. Is there a difference, performance wise,
> > between using the block construct and inserting a bunch of rules, for
> > example
> > BLOCK(eachLine) Line{}{
> >     "(?i).*(?:no) (.*)" -> Rule1NoPattern, 1=Group1;
> >     "(?i)(.*)(not (?:allowed|permitted)).*" -> Rule1NoPattern, 1=Group1;
> >     "(?i).*(no (?:patients?|subjects?|women|woman|men|man|children|child)
> > with)(.*)"->Rule1NoPattern, 1=Group1;
> >     "(?i).*(no .*eviden(?:ce of|t)) (.*)" -> Rule1NoPattern, 2=Group1;
> >     // and so on
> > }
> > or doing this with the inlined rules?
>
> There are no evaluations, but I'd say that there is no real difference
> if you put all inlined rule in one rule element. In contrast to other
> rule based systems, the peformance of ruta mainly depends on how you
> write the rule, e,g,. kind of conditions, outsourcing in wordlists,
> matching order and so on.
>
> In my eprience, many regexes will maybe become slow for large documents.
>
>
> > On this rule
> > Line-> {ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};};
> > If I understand it correctly, I don't think it will do the trick. This
> > would match the UMLSconcept first, right? and then match
> > ANY{ANY{REGEXP("[Nn]o")},
> > right? We care that the UMLSConcept annotation occurs somewhere after the
> > NO patterns, because it helps us identify what it is that there can be NO
> > of. We will be handling "No total body irradiation in the past"
> differenlty
> > from "no pregnancy". The annotation won;t actually be so generic as
> > UMLSConcept, but rather will be based on semantic type (I have an
> annotator
> > I wrote that does that part).
>
> Ah ok. The @ should not chnage the result in this rule. If the negation
> does not need to occur directly before the concept, then you can add
> other (optional) types in between or a wildcard.
>
> Best,
>
> Peter
>
> > thanks so much for your help. I have been able to move forwards quite a
> bit
> > today because of it.
> >
> > Bonnie MacKellar
> >
> >
> > On Mon, Feb 8, 2016 at 9:19 AM, Peter Klügl <peter.kluegl@averbis.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Am 08.02.2016 um 14:49 schrieb Bonnie MacKellar:
> >>> Hi,
> >>>
> >>> Thanks. This is very useful. I did not realize I could use the BLOCK
> >>> construct in this way. It completely did the trick for me. However, I
> >> have
> >>> to admit I don't understand your inlined rule
> >>> Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
> >>> I thought that anything after the -> should be a result, not a match.
I
> >>> think I do not understand Ruta syntax very well, even though I have
> read
> >>> the guide a bunch of times, And that leads to my next concern...
> >> The documentation about inlined rules is here:
> >>
> >>
> https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.inlined
> >>
> >> There are two types of inlined rules: as conditions or as actions.
> >> The inlined rules are placed after the normal condition-action-part and
> >> the type of the inlined rules are indicated by an arrow.
> >> "->" means: treat them as actions
> >> "<-" means: treat them as conditions
> >>
> >> The "->" arrow is normally also used to separate conditions from
> >> actions, but here it indicates inlined rules.
> >> The overall typical syntax for an inlined rule (without some other
> >> stuff) looks something like:
> >> Type{CONDITIONS -> ACTIONS} -> {INLINED RULE;};
> >>
> >> Inlined rules as actions are almost the same as a BLOCK construct, but a
> >> bit more powerful since they can be restricted to a rule element in a
> >> larger rule.
> >> The rules within the brackets are applied within the context of the rule
> >> element if the overall rule was able to match.
> >> For your example with line, this means that the rule successfully
> >> matched on a line, then the inlined regexp rule tries to apply within
> >> the given line, which is every line in the document.
> >>
> >> Inlined rules as condition are evaluated when the conditions of the rule
> >> element are evaluated. The rule element does only match if one of the
> >> inlined rules was able to match.
> >>
> >> Right now, there are some restrictions of the debug-explanation of
> >> inlined rules in the Ruta Workbench. The explanation works better for
> >> blocks.
> >>
> >>
> >>> I do want to move this to annotations instead of strictly using regular
> >>> expressions. We have an extensive set of regular expressions which were
> >>> published by another group, and what we are trying to do is a) run
> these
> >>> expressions against a dataset which is different from the original
> >> group's
> >>> data, to see if we get the same statistics for matches, and then 2) try
> >> to
> >>> improve the matches with annotations obtained from the Metamap
> annotator.
> >>> To do that, I will need to switch to using annotations. I am using the
> >>> regular expressions now, and using them pretty much as they were
> written
> >>> for better or worse, to make sure we don't change the meaning in any
> way
> >>> while getting the first set of statistics.
> >>>
> >>> I have been trying to rewrite these rules using annotations, but I am
> >>> totally failing at it. Everything I try gives me syntax errors. If I
> >> wanted
> >>> to change this example rule to something that used annotations, what
> >> would
> >>> it look like? This is obviously totally wrong and horrible
> >>> LINE({ANY REGEXP((?:no|No)) ANY {CONTAINS UmlsConcept} ->MARK SomeTag})
> >>> what I am trying to say here is that a line that contains any
> characters
> >>> followed by the regular expressions (no|N0} followed by characters that
> >>> have been annotated with UmlsConcept should be marked with SomeTag. I
> >> just
> >>> can't figure out where to put the braces and arrows.
> >> Does this work for you:
> >>
> >> Line-> {ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};};
> >>
> >> The rule matches on every Line annotation, which is sucessful for every
> >> visible line in the document. Then, the inlined rule is applied for
> >> every match (=Line): ANY{REGEXP("[Nn]o")} @UmlsConcept{->SomeTag};
> >> The @ says that the rule should not start to match with the first rule
> >> element but with the rule element with the @ in front of it. This is
> >> just an optimization because ANY would match on every token.
> >> This is the second rule element: @UmlsConcept{->SomeTag}
> >> This rule element just matches on an annotation of the type UmlsConcept.
> >> If there was an UmlsConcept annotation, the rule continues the matching
> >> process with the remaining rule elements, that is the first one. It
> >> checks if the previous token (ANY) fulfills the regular expression
> >> "[Nn]o". If that is also sucessful, then the only action of these rules
> >> is executed. The implicit action creates a new annotation of the type
> >> "SomeTag" on the offests of the UmlsConcept annotation.
> >>
> >> Does this help? Do not hesitate if you have more questions, e.g., for
> >> other rules.
> >>
> >> btw, if you have more rules like this, I'd recommend that you use a
> >> BLOCK construct and separately annotate the negation indicators with a
> >> special type.
> >>
> >> Best,
> >>
> >> Peter
> >>
> >>
> >>> Anyway, my regular expressions seem to be working now!
> >>>
> >>> Bonnie MacKellar
> >>>
> >>> On Mon, Feb 8, 2016 at 4:48 AM, Peter Klügl <peter.kluegl@averbis.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> capturing groups are not supported by the REGEXP condition since it
is
> >>>> essentially just a boolean function and cannot transfer its internal
> >>>> information to an action which creates annotations. However, there are
> >>>> many other ways to solve it.
> >>>>
> >>>> There is maybe a problem with your regexp. I changed it to
> ".*(?:no|No)
> >>>> (.*)" in the following.
> >>>>
> >>>> You can, for example, use the simple regexp rule and restrict its
> >>>> matching context to each line:
> >>>>
> >>>> ... with a BLOCK:
> >>>> BLOCK(eachLine) Line{}{
> >>>>     ".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;
> >>>> }
> >>>>
> >>>> ... with an inlined rule:
> >>>> Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
> >>>>
> >>>> Some additional comments:
> >>>>
> >>>> You should mention the type Line in the EXEC action for reindexing,
if
> >>>> you want to use these annotations in the following rules:
> >>>> Document{-> EXEC(PlainTextAnnotator, {Line})};
> >>>> For your rules, it does not make a difference, but if you use other
> >>>> conditions like PARTOF, it will not work correctly.
> >>>>
> >>>> From my experience, I'd recommend to work directly with annotations
> >>>> instead of regexes for detecting the target of a negation. Then, you
> can
> >>>> refactor the rules more easily, e.g., if you have a rule like
> >>>> Line->{PrefixNegationInd #{-> Group1};}; you can replace the wildcard
> >>>> with something better in future like ChunkNP. (I just wanted to
> mention
> >>>> it. I know that your example was probably just an example to describe
> >>>> the problem with ruta.)
> >>>>
> >>>> Best,
> >>>>
> >>>> Peter
> >>>>
> >>>> Am 08.02.2016 um 00:37 schrieb Bonnie MacKellar:
> >>>>> Hi,
> >>>>>
> >>>>> I am trying to write RUTA rules using regular expressions and
> capturing
> >>>>> groups. I want the matches to be line by line. I can do this using
> the
> >>>>> following script
> >>>>>
> >>>>> ENGINE utils.PlainTextAnnotator;
> >>>>> TYPESYSTEM utils.PlainTextTypeSystem;
> >>>>> Document{-> RETAINTYPE(BREAK)};
> >>>>> Document{-> EXEC(PlainTextAnnotator)};
> >>>>> DECLARE Rule1NoPattern, Group1, Group2;
> >>>>> Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
> >>>>>
> >>>>> Given this text
> >>>>> Not pregnant or nursing
> >>>>> Fertile patients must use effective contraception (hormonal
> >> contraception
> >>>>> or intra-uterine device [IUD])
> >>>>> No concurrent participation in another clinical trial that would
> >> preclude
> >>>>> the interventions or outcome assessment of this clinical trial
> >>>>> No other concurrent anticancer therapy
> >>>>>
> >>>>> it correctly matches the last two lines and annotates them with
> >>>>> Rule1NoPattern
> >>>>> The problem is, I want to use the capturing group information as
> well.
> >> I
> >>>>> can do this using the simple regular expression syntax
> >>>>> ".*no|No (.*)\n|S" -> Rule1NoPattern, 1=Group1;
> >>>>>
> >>>>> if I just give it one line, say
> >>>>> No other concurrent anticancer therapy
> >>>>>
> >>>>> it will correctly annotate the entire line with Rule1NoPattern,
and
> >>>> "other
> >>>>> concurrent anticancer therapy" wll be annotated with Group1.
> >>>>> Is there a way, using the first rule variant
> >>>>> Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
> >>>>>
> >>>>> to annotate the text in capturing group?
> >>>>>
> >>>>> I have tried all kinds of syntax, but none of it seems to be correct
> >>>>>
> >>>>> thanks,
> >>>>> Bonnie MacKellar
> >>>>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message