uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Juric ...@unsilo.ai>
Subject Re: Question about covering annotations in Ruta match semantics
Date Mon, 21 Oct 2019 19:46:09 GMT
Thanks Peter,

No problem with the delay. I was on vacation myself, and sometimes it is just necessary to
pull the plug :)

I am just happy that you take the time to answer my questions, and I think your answers help
making sense to this. I now have some ideas that I can experiment with to see what works,
but it’s possible to use RutaBasic when optional spaces are included in the rules, although
it gets more awkward. I would still prefer to avoid this and having a type-based rule-logic
feature would makes sense in our case. Shall I create a feature request for this?

I wouldn’t expect you to do this any time soon, but let me know if there is something I
could help out with when the time comes.

Cheers,
Mario













> On 18 Oct 2019, at 10:10 , Peter Klügl <peter.kluegl@averbis.com> wrote:
> 
> Hi,
> 
> 
> sorry for the delayed reply.
> 
> 
> comments below...
> 
> 
> Am 09.10.2019 um 22:19 schrieb Mario Juric:
>> Hi Peter,
>> 
>> Thanks a lot for the answer.
>> 
>> I am still trying to wrap my head around this, and I understand the issues at play
when dealing with a generic rule engine, since I am looking at an isolated case only. I was
just thinking that in my particular case the covering annotation starts before matching 'Dog
Cat’, so why would its ending right before Cat prevent the rule from firing? It doesn’t
follow Dog, and a rule like “Dog Covering {->MARK(CHASE)}” wouldn’t therefore be
matched either, but I understand now that it is enough that something else being present in
this area between the two rule elements is enough for the match to fail. However, as you describe,
the presence of SPACE annotations and a rule like Dog SPACE Cat { -> MARK(CHASE)} would
succeed in matching despite the presence of the covering annotation.
> 
> 
> The main thing here is probably the requirement that the logic for
> applying the visibility concept should always be symmetric, meaning it
> should be the same regardless if the rule matches from left to right or
> from right to left (or inside out).
> 
> In your example, the rule matches from left to right (I assume), so that
> behavior that the last space is not skipped is not intuitive at all.
> However, if the rule would match for some reason from right to left,
> e.g., because of dynamic anchoring or a manual anchor, then the
> inference would detect a starting Covering annotation as the next
> possible position, which is not invisible (since there is nothing at all
> invisible). So there would actually be something that could be matched,
> but it is not the correct type (Dog).
> 
> I do not know if this explanation makes sense... it's easier with a
> whiteboard ;-)
> 
> 
> 
>> Have you ever described the implementation of the matching in some paper or similar?
I would be interested to have a look at it, but maybe it’s better just to have a go at the
code? I would certainly prefer reading a high level abstract specification first though :)
> 
> 
> The last paper is the NLE journal article, which contains some high
> level description of the algorithm. However, this is some really
> specific functionality for a specific scenario. So, if I write a new
> paper, it will most likely not cover this.
> 
> 
>> 
>> Generally I cannot just trim the annotations in the real application, since some
of these whitespaces are included in the marking for various reasons. I therefore played around
with type filtering, since I was hoping that the type filter would allow me to match the rules
while ignoring any presence of filtered types. I was again surprised to find out that filtering
the Covering type while retaining Cat and Dog would in this case just prevent anything from
being matched, because it seems to make all those text parts invisible where the filtered
types appear, no matter if they cover any retained annotation types. So this didn’t seem
to solve my problem either, although I could of course try to mark those areas I otherwise
would consider trimming and include those in the rules like a space or filter on them, which
I guess is what you suggested. It suddenly just becomes somewhat awkward though, and it may
just be more clear to use RutaBasic with the rules instead.
> 
> 
> Yes, the visibility concept in Ruta is not type-based but type
> coverage-based (and I think that's really cool)
> 
> It is possible to extend the functionality to additionally support
> type-based logic, but I do not know when this would be ready.
> 
> I would not recommend to use RutaBasic in the rules (I actually do not
> know right now, if it would work), but if you do, then you should
> probably deactivate the "empty is invisible" option.
> 
> 
> Best,
> 
> 
> Peter
> 
> 
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 9 Oct 2019, at 09:35 , Peter Klügl <peter.kluegl@averbis.com> wrote:
>>> 
>>> Hi Mario,
>>> 
>>> 
>>> I need to take a closer look as this is not the usual scenario :-)
>>> 
>>> 
>>> However, without testing, I would assume that the second rule does not
>>> match because the space between dog and cat is not "empty".
>>> 
>>> 
>>> Normally, you have a complete partitioning provided by the seeding which
>>> causes the RutaBasic annotations. If there are only a few annotations,
>>> then there needs to be a decision if a text position is visible or not
>>> (as you have no SPACE, BREAK and MARKUP annotation). You would expect
>>> that the space between the annotations is ignored, but there is actually
>>> no reason why Ruta should do that, as there is no information at all
>>> that it should be ignored (... generic system, you might want to write
>>> rules for whitespaces...). In order to avoid this problem in such
>>> situations there is the option to define empty RutaBasics as invisible.
>>> That are text position where no annotation begins or ends (and not
>>> covered by annotations) AFAIR and sequential matching could not match at
>>> all anyway. Thus, the first space is ignored, but the not the second,
>>> because the Covering annotation ends there.
>>> 
>>> 
>>> Does that make sense?
>>> 
>>> 
>>> I think there are many option how your rules can become more robust, but
>>> that depends on your complete system/pipeline. Is it an option to trim
>>> annotations in order to avoid whitespaces at the beginning or ending? Is
>>> it easy to identify these positions? You could create an annotation
>>> there and filter it the type.
>>> 
>>> 
>>> 
>>> Best,
>>> 
>>> 
>>> Peter
>>> 
>>> 
>>> 
>>> Am 07.10.2019 um 10:21 schrieb Mario Juric:
>>>> Hi Peter,
>>>> 
>>>> I have a script that is executed without any seeders for performance reasons,
and we don’t need the seeded annotations in that case. I have an issue involving annotation
elements that partially cover the rule elements of interest, and I do not have a simple solution
for it, so I have a question about the match semantics. Let me explain it using a simple example
and the text ‘cat dog cat’.
>>>> 
>>>> Assume the following 4 annotation types and 2 rule statements:
>>>> 
>>>> DECLARE Covering;
>>>> DECLARE Cat;
>>>> DECLARE Dog;
>>>> DECLARE CHASE;
>>>> Cat Dog { -> MARK(CHASE)};
>>>> Dog Cat { -> MARK(CHASE)};
>>>> Assume prior to script execution the following annotations with beginnings
and endings:
>>>> 
>>>> Cat[0,3[
>>>> Dog[4,7[
>>>> Cat[8,11[
>>>> Covering[0,8[
>>>> 
>>>> The Covering annotation is an example of the disturbing element that I observed,
which has nothing or little to do with what I am trying to match. It just happens to be there
for a reason unrelated to these rules, but it causes the second rule not to match when I expected
it. Only the first rule fires, but the second will also fire when I change Covering bounds
to [0,7[ though.
>>>> 
>>>> The order in which elements are matched seems very different from how they
are usually selected from the CAS index, where you would get 'Covering Cat Dog Cat’, and
with this order you would intuitvely expect both rules to match. This would probably be overly
simplified though, since I would not be able to match adjacent covering annotations this way,
so I believe matching is somehow based on edge detection. Sill, I have difficulties to understand
why that extra covering space makes a difference.
>>>> 
>>>> I was hoping you could provide me with some details, and I also like to know
what possible workaround options I have. I was considering playing around with type filtering,
but it would require a bit of adding/removing types to be filtered during the script, so it
didn’t seem as the simplest solution. Ensuring that covering always aligns with the end
of a token is another possibility in this particular case, but I still need to add general
robustness to the Ruta script against these scenarios. Any feedback is mostly appreciated,
thanks :)
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> -- 
>>> Dr. Peter Klügl
>>> R&D Text Mining/Machine Learning
>>> 
>>> Averbis GmbH
>>> Salzstr. 15
>>> 79098 Freiburg
>>> Germany
>>> 
>>> Fon: +49 761 708 394 0
>>> Fax: +49 761 708 394 10
>>> Email: peter.kluegl@averbis.com
>>> Web: https://averbis.com
>>> 
>>> Headquarters: Freiburg im Breisgau
>>> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
>>> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
>>> 
>> 
> -- 
> Dr. Peter Klügl
> R&D Text Mining/Machine Learning
> 
> Averbis GmbH
> Salzstr. 15
> 79098 Freiburg
> Germany
> 
> Fon: +49 761 708 394 0
> Fax: +49 761 708 394 10
> Email: peter.kluegl@averbis.com <mailto:peter.kluegl@averbis.com>
> Web: https://averbis.com <https://averbis.com/>
> 
> Headquarters: Freiburg im Breisgau
> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message