uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bonnie MacKellar <bkmackel...@gmail.com>
Subject Re: question on REGEXP in Ruta
Date Mon, 08 Feb 2016 13:49:26 GMT

Thanks. This is very useful. I did not realize I could use the BLOCK
construct in this way. It completely did the trick for me. However, I have
to admit I don't understand your inlined rule
Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
I thought that anything after the -> should be a result, not a match. I
think I do not understand Ruta syntax very well, even though I have read
the guide a bunch of times, And that leads to my next concern...

I do want to move this to annotations instead of strictly using regular
expressions. We have an extensive set of regular expressions which were
published by another group, and what we are trying to do is a) run these
expressions against a dataset which is different from the original group's
data, to see if we get the same statistics for matches, and then 2) try to
improve the matches with annotations obtained from the Metamap annotator.
To do that, I will need to switch to using annotations. I am using the
regular expressions now, and using them pretty much as they were written
for better or worse, to make sure we don't change the meaning in any way
while getting the first set of statistics.

I have been trying to rewrite these rules using annotations, but I am
totally failing at it. Everything I try gives me syntax errors. If I wanted
to change this example rule to something that used annotations, what would
it look like? This is obviously totally wrong and horrible
LINE({ANY REGEXP((?:no|No)) ANY {CONTAINS UmlsConcept} ->MARK SomeTag})
what I am trying to say here is that a line that contains any characters
followed by the regular expressions (no|N0} followed by characters that
have been annotated with UmlsConcept should be marked with SomeTag. I just
can't figure out where to put the braces and arrows.

Anyway, my regular expressions seem to be working now!

Bonnie MacKellar

On Mon, Feb 8, 2016 at 4:48 AM, Peter Kl├╝gl <peter.kluegl@averbis.com>

> Hi,
> capturing groups are not supported by the REGEXP condition since it is
> essentially just a boolean function and cannot transfer its internal
> information to an action which creates annotations. However, there are
> many other ways to solve it.
> There is maybe a problem with your regexp. I changed it to ".*(?:no|No)
> (.*)" in the following.
> You can, for example, use the simple regexp rule and restrict its
> matching context to each line:
> ... with a BLOCK:
> BLOCK(eachLine) Line{}{
>     ".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;
> }
> ... with an inlined rule:
> Line->{".*(?:no|No) (.*)" -> Rule1NoPattern, 1=Group1;};
> Some additional comments:
> You should mention the type Line in the EXEC action for reindexing, if
> you want to use these annotations in the following rules:
> Document{-> EXEC(PlainTextAnnotator, {Line})};
> For your rules, it does not make a difference, but if you use other
> conditions like PARTOF, it will not work correctly.
> From my experience, I'd recommend to work directly with annotations
> instead of regexes for detecting the target of a negation. Then, you can
> refactor the rules more easily, e.g., if you have a rule like
> Line->{PrefixNegationInd #{-> Group1};}; you can replace the wildcard
> with something better in future like ChunkNP. (I just wanted to mention
> it. I know that your example was probably just an example to describe
> the problem with ruta.)
> Best,
> Peter
> Am 08.02.2016 um 00:37 schrieb Bonnie MacKellar:
> > Hi,
> >
> > I am trying to write RUTA rules using regular expressions and capturing
> > groups. I want the matches to be line by line. I can do this using the
> > following script
> >
> > ENGINE utils.PlainTextAnnotator;
> > TYPESYSTEM utils.PlainTextTypeSystem;
> > Document{-> RETAINTYPE(BREAK)};
> > Document{-> EXEC(PlainTextAnnotator)};
> > DECLARE Rule1NoPattern, Group1, Group2;
> > Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
> >
> > Given this text
> > Not pregnant or nursing
> > Fertile patients must use effective contraception (hormonal contraception
> > or intra-uterine device [IUD])
> > No concurrent participation in another clinical trial that would preclude
> > the interventions or outcome assessment of this clinical trial
> > No other concurrent anticancer therapy
> >
> > it correctly matches the last two lines and annotates them with
> > Rule1NoPattern
> > The problem is, I want to use the capturing group information as well. I
> > can do this using the simple regular expression syntax
> > ".*no|No (.*)\n|S" -> Rule1NoPattern, 1=Group1;
> >
> > if I just give it one line, say
> > No other concurrent anticancer therapy
> >
> > it will correctly annotate the entire line with Rule1NoPattern, and
> "other
> > concurrent anticancer therapy" wll be annotated with Group1.
> > Is there a way, using the first rule variant
> > Line{REGEXP(".*no|No (.*)") -> Rule1NoPattern};
> >
> > to annotate the text in capturing group?
> >
> > I have tried all kinds of syntax, but none of it seems to be correct
> >
> > thanks,
> > Bonnie MacKellar
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message