uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Gazzo <mario.ga...@gmail.com>
Subject Re: Very long Ruta stream initialization
Date Mon, 04 Jan 2016 15:13:04 GMT
Hi Peter,

No problem, I was anyway pretty much offline myself during Christmas holidays.

The term “overhead” is probably an exaggeration in this context especially after I disabled
the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to
better fit our needs with additional annotation types and properties, so the Ruta MARKUP is
currently not used. It just happens that we don’t directly use RutaBasic in any of our rules
in this particular case so I was curious to know whether we could avoid creating them in the
first place since there seems to be quite a few. However, overall processing required by our
Ruta scripts compared to other processing steps is now small and sub-optimising this further
by making RutaBasic optional would currently be of very low priority to us. We would prioritise
other features higher e.g. being able to assign annotations to variables as we discussed previously
in another thread.

We haven’t processed documents as large as those you mention since books have so far been
divided into chapters and processing could therefore be parallelised accordingly. We also
drop extreme outliers above a certain size if we encounter them and then we batch process
them later in smaller chunks but this has so far not been necessary with our current data
sets. Like you, our processing bottlenecks are now in different components.


> On 30 Dec 2015, at 16:44 , Peter Klügl <peter.kluegl@averbis.com> wrote:
> Hi,
> sorry for the delayed reply.
> RutaEngine::initializeStream:
> The special treatment of MARKUPs that causes the increased time required for initialization
is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but
failed. It shouldn't be hard be to improve this code... I will create an issue for it. When
I did the last performance optimization, uima did not check the indexes yet and my test set
did not contain markups.
> Deactivate creation of RutaBasic:
> Short answer is no. I was already thinking about making RutaBasic optional in future
so that the user can configure if they are used. However, right now, they are required for
rule inference and make the rule inference "fast" in the first place. RutaBasic is just an
internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should
not match on them at all.
> Some background information:
> RutaBasics are used for three things:
> - store additional information in order to avoid index operations. Some useful conditions
would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a
cache what annotations start and end at which position, and which positions are covered by
which types.
> - provide a container to make this information available across analysis engines. Information
shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external
resources). This is the role of RutaBasic. It is not really implemented right now as it should
be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed
with small ruta engines.
> - a basic minimal disjunct partitioning of the document for the coverage based visibility
> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order
to reduce the memory footprint or when processing large documents where parts are simply not
interesting, then I will put it on my TODO list. I am also open for other/new ideas how to
solve the challenges (and for incremental usage of internal caches).
> What is your experience with the processing overhead concerning RutaBasic? Is it the
rule matching or rather the initialization? I myself had already some performance problems
with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other
components, serialization and the CAS editor were the actual bottlenecks.
> Best,
> Peter
> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>> I got around it by removing the default seeders by specifying an empty seeders list
since we don’t need the MARKUP annotations anymore.
>> I still don’t know why it created so much overhead but it sometimes seemed to rival
the POS tagger in processing time.
>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic
annotations entirely to save processing overhead and only apply Ruta rules to other annotation
types created by other AEs such as our own?
>> Cheers
>> Mario
>>> On 21 Dec 2015, at 16:09 , Mario Juric <mario.juric.dk@gmail.com> wrote:
>>> Hi Peter,
>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream
can tak very long time. I can’t really explain them and it seems independent of document
length since I have seen this with even very small XML documents.
>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP
annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems
to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes.
I haven’t gone into any deeper analysis yet but I first like to hear whether you have an
idea what could be the main cause(s) for this?
>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>> Cheers
>>> Mario

View raw message