uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <pklu...@uni-wuerzburg.de>
Subject Re: Ruta - MARKFAST
Date Mon, 30 Jun 2014 13:40:26 GMT
Am 30.06.2014 15:31, schrieb Peter Klügl:
> Am 30.06.2014 14:58, schrieb Armin.Wegner@bka.bund.de:
>> Hi, Peter!
>> I got that. I restricted MARKFAST on segments. It works just nearly
> perfect. How does MARKFAST match things? Using
>> Document{->MARKFAST(MyType, { "a", "b", "a b" });

Well, when spending another thought about it, then it is clear... The
matching process considers the longest match. I don't think that all
matches are currently supported, but it should not be complicated to add
the functionality. You can open a feature request if you want.


> hehe... I didn't even remember that this is possible. I will open an
> issue for string lists.
> The normal application of MARKFAST is with word lists:
> WORDLIST MyList = 'somelist.txt';
> Document{-> MARKFAST(MyType, MyList)};
> ... whereas the file somelists.txt contains something like:
> a
> b
> a b
> Files with endings "twl" and "mtwl" are for compiled dictionaries.
> Just to mention:
> The usage of characters (in the word list) that are filtered when
> applying the dictionary lookup may cause unexpected behavior because the
> algorithm may choose the wrong subtree. I happened once in our
> applications until now.
> Best,
> Peter
>> on
>> a b
>> yields
>> "a b" and "b" but not "a".
>> I would like to have "a" as well. Can this be done?
>> Buy the way: I love Ruta.apply(). That is exactly what I needed.
>> Thanks,
>> Armin
>> -----Ursprüngliche Nachricht-----
>> Von: Peter Klügl [mailto:pkluegl@uni-wuerzburg.de]
>> Gesendet: Montag, 30. Juni 2014 12:51
>> An: user@uima.apache.org
>> Betreff: Re: Ruta - MARKFAST
>> Hi,
>> Am 30.06.2014 11:32, schrieb Armin.Wegner@bka.bund.de:
>>> Hello!
>>> On which annotation type does MARFKAST work?
>> It is applied on the annotations, on which the rule element of the
> action matched.
>> Document{-> MARKFAST(...)};
>> ... causes a dictionary lookup on the complete document.
>> Sentence{CONTAINS(...) -> MARKFAST(...)}; ... causes a separate
> dictionary lookup on each of the matched sentences (e.g., no
> inter-sentence annotations).
>>> Can I restrict MARKFAST to a single annotation Type, say my own token
> type?
>> No, but there is an issue that includes this functionality.
>> UIMA-3775: Fast multi token dictionary matching on feature values
>> The idea is the apply the dictionary lookup on sequences feature
> values (e.g., lemmas). If the feature represents the covered text, then
> this would also support your use case. The issue is not top priority
> right now, but if you want, then I can try to include it in the next
> release (August).
>>> It would be nice to restrict a ruta script to a set of annotations by
>>> giving that set of annotations
>> explicitly, like
>>> Document{-> INPUT(Token, Organization, Location)};
>> UIMA Ruta follows a different strategy, e.g., compared to JAPE and its
> input specification. The availability and visibility of annotations is
> not type-based but coverage-based. This enables the easy specification
> of complex patterns, but also complicates the things sometimes. If one
> type is set to invisible (FILTERTYPE), then all annotations of this type
> and all covered annotations of other types are invisible.
>> The MARKFAST action operates on the RutaStream and thus is lookup is
> sensitive to the filtering setting. For example, the lookup ignored
> whitespaces, breaks and markup using the default settings. By extending
> the set of filtered types, you can also change the behavior of the
> dictionary lookup. However, mind that annotations covered by one of the
> types are also not accessible by the dictionary.
>>> All other annotations should be ignored. Is there a way to do this in
>> Ruta? Can this by done with FILTERTYPE and RETAINTYPE? How?
>> Yes, but it depends on the actual occurrences of types in your document.
>> The easiest way is to filter the types of the annotations that cover
> the positions that should be skipped. It's not easy to give a generic
> solution for this.
>> An example:
>> Your tokenizer creates annotations for words and numbers, but not for
> punctuation marks, and you want to apply the dictionary lookup only for
> sequences of token annotations skipping punctuation marks.
>> Document{-> FILTERTYPE(PM)};
>> Document{-> MARKFAST(...)};
>> There are plans to extend and modify the concept of accessibility and
> visibility in UIMA Ruta sometime (>= 3.0.0). Any wishes and opinions are
> welcome :-)
>> Best,
>> Peter
>>> Cheers,
>>> Armin

View raw message