uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Armin.Weg...@bka.bund.de>
Subject AW: Ruta - MARKFAST
Date Mon, 30 Jun 2014 12:58:38 GMT
Hi, Peter!

I got that. I restricted MARKFAST on segments. It works just nearly perfect. How does MARKFAST
match things? Using

Document{->MARKFAST(MyType, { "a", "b", "a b" });


a b


"a b" and "b" but not "a".

I would like to have "a" as well. Can this be done?

Buy the way: I love Ruta.apply(). That is exactly what I needed.


-----Urspr├╝ngliche Nachricht-----
Von: Peter Kl├╝gl [mailto:pkluegl@uni-wuerzburg.de] 
Gesendet: Montag, 30. Juni 2014 12:51
An: user@uima.apache.org
Betreff: Re: Ruta - MARKFAST


Am 30.06.2014 11:32, schrieb Armin.Wegner@bka.bund.de:
> Hello!
> On which annotation type does MARFKAST work?

It is applied on the annotations, on which the rule element of the action matched.

Document{-> MARKFAST(...)};
... causes a dictionary lookup on the complete document.

Sentence{CONTAINS(...) -> MARKFAST(...)}; ... causes a separate dictionary lookup on each
of the matched sentences (e.g., no inter-sentence annotations).

> Can I restrict MARKFAST to a single annotation Type, say my own token type?

No, but there is an issue that includes this functionality.

UIMA-3775: Fast multi token dictionary matching on feature values

The idea is the apply the dictionary lookup on sequences feature values (e.g., lemmas). If
the feature represents the covered text, then this would also support your use case. The issue
is not top priority right now, but if you want, then I can try to include it in the next release

> It would be nice to restrict a ruta script to a set of annotations by 
> giving that set of annotations
explicitly, like
> Document{-> INPUT(Token, Organization, Location)};

UIMA Ruta follows a different strategy, e.g., compared to JAPE and its input specification.
The availability and visibility of annotations is not type-based but coverage-based. This
enables the easy specification of complex patterns, but also complicates the things sometimes.
If one type is set to invisible (FILTERTYPE), then all annotations of this type and all covered
annotations of other types are invisible.

The MARKFAST action operates on the RutaStream and thus is lookup is sensitive to the filtering
setting. For example, the lookup ignored whitespaces, breaks and markup using the default
settings. By extending the set of filtered types, you can also change the behavior of the
dictionary lookup. However, mind that annotations covered by one of the types are also not
accessible by the dictionary.

> All other annotations should be ignored. Is there a way to do this in
Ruta? Can this by done with FILTERTYPE and RETAINTYPE? How?

Yes, but it depends on the actual occurrences of types in your document.
The easiest way is to filter the types of the annotations that cover the positions that should
be skipped. It's not easy to give a generic solution for this.

An example:
Your tokenizer creates annotations for words and numbers, but not for punctuation marks, and
you want to apply the dictionary lookup only for sequences of token annotations skipping punctuation

Document{-> FILTERTYPE(PM)};
Document{-> MARKFAST(...)};

There are plans to extend and modify the concept of accessibility and visibility in UIMA Ruta
sometime (>= 3.0.0). Any wishes and opinions are welcome :-)



> Cheers,
> Armin

View raw message