uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <pklu...@uni-wuerzburg.de>
Subject Re: [ruta] How to efficiently delete an annotation only if it appears within the N first token of a document?
Date Wed, 28 Aug 2013 16:19:19 GMT
On 28.08.2013 18:17, Alexandre Patry wrote:
> On 2013-08-28 11:25, Peter Klügl wrote:
>> On 28.08.2013 16:52, Alexandre Patry wrote:
>>> Hi,
>>>
>>> I use RUTA and I want to delete an annotation if it is within the
>>> first 50 tokens of a document. I came up with the following rules :
>>>
>>>     ANY{POSITION(Document, 1)-> Header};                // Annotate the
>>>     first token in the document
>>>     Header{->SHIFT(Header, 1, 2)} ANY[0,49];            // Appends the
>>>     49 following tokens
>>>     ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};// Delete the
>>>     first ToDelete if it is within the header
>>>
>>>
>>> These rules work as expected but they are *really* slow. Is there a
>>> faster way to achieve that?
>>>
>> Oh yes, the first rule is really slow. I always miss an action MARKFIRST
>> (as there is a MARKLAST). I will add it today or tomorrow.
>>
>> There are two reasons why the first rule is slow:
>> ANY has to look at all tokens and POSITION is just the slowest condition
>> in Ruta.
>>   For now you could use a rule like:
>> ANY{STARTSWITH(Document)-> Header};
>> ... which avoids at least the POSITION condition.
>>
>> A simple test with a 200 W document:
>>
>> ...
>> ANY{POSITION(Document, 1)-> Header}; // [0.274s|93.52%]
>> Header{->SHIFT(Header, 1, 2)} ANY[0,49];  // [0.090s|3.07%]
>> ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.030s|1.02%]
>>
>> ...
>> ANY{STARTSWITH(Document)-> Header};  // [0.047s|50.00%]
>> Header{->SHIFT(Header, 1, 2)} ANY[0,49];  // [0.029s|30.85%]
>> ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.011s|11.7%]
>>
>> well, that's still slow (in debug mode) and I actually wonder why the
>> other rules are getting faster... but I hope that the performance will
>> soon be improved :-)
> Just tried it and it is much better, thanks!
>
> Many of my documents start with space, so I had to update the rules to :
>
>    Document{-> ADDRETAINTYPE(SPACE, BREAK)};
>    ANY{STARTSWITH(Document) -> Header};
>    // if the first token is a space, use the first non-space following it
>    Header{IS({SPACE, BREAK}) -> UNMARK(Header)} ANY*?
>    ANY{-PARTOF({SPACE, BREAK}) -> MARK(Header)};
>    Document{-> REMOVERETAINTYPE(SPACE, BREAK)};
>
>    Header{->SHIFT(Header, 1, 2)} ANY[0,49];
>    ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};
>
> I will be happy to test drive MARKFIRST when it will be in trunk.
>

It's already in the trunk. If you want, then I can also think of
something that avoid the visibility problem.

Best,

Peter


> Alexandre
>


Mime
View raw message