uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: Very long Ruta stream initialization
Date Wed, 30 Dec 2015 15:44:14 GMT
Hi,

sorry for the delayed reply.

RutaEngine::initializeStream:

The special treatment of MARKUPs that causes the increased time required 
for initialization is just a workaround because I was to lazy to write a 
working jflex rule. Well, I tried but failed. It shouldn't be hard be to 
improve this code... I will create an issue for it. When I did the last 
performance optimization, uima did not check the indexes yet and my test 
set did not contain markups.

Deactivate creation of RutaBasic:
Short answer is no. I was already thinking about making RutaBasic 
optional in future so that the user can configure if they are used. 
However, right now, they are required for rule inference and make the 
rule inference "fast" in the first place. RutaBasic is just an internal 
annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and 
rules should not match on them at all.

Some background information:

RutaBasics are used for three things:
- store additional information in order to avoid index operations. Some 
useful conditions would require many index operations, e.g., PARTOF or 
ENDSWITH. RutaBasic is utilized as a cache what annotations start and 
end at which position, and which positions are covered by which types.
- provide a container to make this information available across analysis 
engines. Information shared by analysis engine is normally stored in the 
CAS, e.g. in annotations, (or in external resources). This is the role 
of RutaBasic. It is not really implemented right now as it should be but 
I will improve it soon. Then, there is no performance decrease when a 
pipeline is spammed with small ruta engines.
- a basic minimal disjunct partitioning of the document for the coverage 
based visibility concept.

Making RutaBasic optional is possible. If there is a real need for it, 
e.g., in order to reduce the memory footprint or when processing large 
documents where parts are simply not interesting, then I will put it on 
my TODO list. I am also open for other/new ideas how to solve the 
challenges (and for incremental usage of internal caches).

What is your experience with the processing overhead concerning 
RutaBasic? Is it the rule matching or rather the initialization? I 
myself had already some performance problems with the initalization and 
memory consumption in large CAS (500+ pages pdfs). However, other 
components, serialization and the CAS editor were the actual bottlenecks.

Best,

Peter


Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
> I got around it by removing the default seeders by specifying an empty seeders list since
we don’t need the MARKUP annotations anymore.
>
> I still don’t know why it created so much overhead but it sometimes seemed to rival
the POS tagger in processing time.
>
> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic
annotations entirely to save processing overhead and only apply Ruta rules to other annotation
types created by other AEs such as our own?
>
> Cheers
> Mario
>
>> On 21 Dec 2015, at 16:09 , Mario Juric <mario.juric.dk@gmail.com> wrote:
>>
>> Hi Peter,
>>
>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can
tak very long time. I can’t really explain them and it seems independent of document length
since I have seen this with even very small XML documents.
>>
>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations
during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while
loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t
gone into any deeper analysis yet but I first like to hear whether you have an idea what could
be the main cause(s) for this?
>>
>> We use Ruta 2.3.1 with UIMA 2.8.1
>>
>>
>> Cheers
>> Mario


Mime
View raw message