uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <pklu...@uni-wuerzburg.de>
Subject Re: Ruta - MARKFAST
Date Mon, 30 Jun 2014 13:31:58 GMT
Am 30.06.2014 14:58, schrieb Armin.Wegner@bka.bund.de:
> Hi, Peter!
>
> I got that. I restricted MARKFAST on segments. It works just nearly
perfect. How does MARKFAST match things? Using
>
> Document{->MARKFAST(MyType, { "a", "b", "a b" });

hehe... I didn't even remember that this is possible. I will open an
issue for string lists.

The normal application of MARKFAST is with word lists:

WORDLIST MyList = 'somelist.txt';
Document{-> MARKFAST(MyType, MyList)};

... whereas the file somelists.txt contains something like:

a
b
a b

Files with endings "twl" and "mtwl" are for compiled dictionaries.

Just to mention:
The usage of characters (in the word list) that are filtered when
applying the dictionary lookup may cause unexpected behavior because the
algorithm may choose the wrong subtree. I happened once in our
applications until now.

Best,

Peter



>
> on
>
> a b
>
> yields
>
> "a b" and "b" but not "a".
>
> I would like to have "a" as well. Can this be done?
>
> Buy the way: I love Ruta.apply(). That is exactly what I needed.
>
> Thanks,
> Armin
> 
>
> -----Ursprüngliche Nachricht-----
> Von: Peter Klügl [mailto:pkluegl@uni-wuerzburg.de]
> Gesendet: Montag, 30. Juni 2014 12:51
> An: user@uima.apache.org
> Betreff: Re: Ruta - MARKFAST
>
> Hi,
>
> Am 30.06.2014 11:32, schrieb Armin.Wegner@bka.bund.de:
>> Hello!
>>
>> On which annotation type does MARFKAST work?
>
> It is applied on the annotations, on which the rule element of the
action matched.
>
> Document{-> MARKFAST(...)};
> ... causes a dictionary lookup on the complete document.
>
> Sentence{CONTAINS(...) -> MARKFAST(...)}; ... causes a separate
dictionary lookup on each of the matched sentences (e.g., no
inter-sentence annotations).
>
>
>> Can I restrict MARKFAST to a single annotation Type, say my own token
type?
>
> No, but there is an issue that includes this functionality.
>
> UIMA-3775: Fast multi token dictionary matching on feature values
>
> The idea is the apply the dictionary lookup on sequences feature
values (e.g., lemmas). If the feature represents the covered text, then
this would also support your use case. The issue is not top priority
right now, but if you want, then I can try to include it in the next
release (August).
>
>> It would be nice to restrict a ruta script to a set of annotations by
>> giving that set of annotations
> explicitly, like
>>
>> Document{-> INPUT(Token, Organization, Location)};
>
> UIMA Ruta follows a different strategy, e.g., compared to JAPE and its
input specification. The availability and visibility of annotations is
not type-based but coverage-based. This enables the easy specification
of complex patterns, but also complicates the things sometimes. If one
type is set to invisible (FILTERTYPE), then all annotations of this type
and all covered annotations of other types are invisible.
>
> The MARKFAST action operates on the RutaStream and thus is lookup is
sensitive to the filtering setting. For example, the lookup ignored
whitespaces, breaks and markup using the default settings. By extending
the set of filtered types, you can also change the behavior of the
dictionary lookup. However, mind that annotations covered by one of the
types are also not accessible by the dictionary.
>
>>
>> All other annotations should be ignored. Is there a way to do this in
> Ruta? Can this by done with FILTERTYPE and RETAINTYPE? How?
>
> Yes, but it depends on the actual occurrences of types in your document.
> The easiest way is to filter the types of the annotations that cover
the positions that should be skipped. It's not easy to give a generic
solution for this.
>
> An example:
> Your tokenizer creates annotations for words and numbers, but not for
punctuation marks, and you want to apply the dictionary lookup only for
sequences of token annotations skipping punctuation marks.
>
> Document{-> FILTERTYPE(PM)};
> Document{-> MARKFAST(...)};
>
>
> There are plans to extend and modify the concept of accessibility and
visibility in UIMA Ruta sometime (>= 3.0.0). Any wishes and opinions are
welcome :-)
>
>
>
> Best,
>
> Peter
>
>
>>
>>
>> Cheers,
>> Armin
>>
>
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message