uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Armin.Weg...@bka.bund.de>
Subject AW: Ruta 2.4.0 - High memory needs
Date Wed, 10 Aug 2016 13:35:17 GMT
Hi Peter,

I will give it a try and report back in a view days.

Thanks a lot,

-----Urspr├╝ngliche Nachricht-----
Von: Peter Kl├╝gl [mailto:peter.kluegl@averbis.com] 
Gesendet: Mittwoch, 10. August 2016 14:50
An: user@uima.apache.org
Betreff: Re: Ruta 2.4.0 - High memory needs


18MB of text in a CAS, well that's a quite big sofa.

Yes, there are some tricks and best prectices.

First of all, there is the configuration parameter "lowMemoryProfile",
which reduces the information stored in RutaBasic. It should reduce the
memory usage considerably, but the processing will take longer,
especially if the type hierarchy is rather deep. The unit tests for it
do not cover all functionality of ruta. I only test all unit test with
this option once in a while, and I haven't done this for some time.


The second thing to do in order to reduce the memory usage is to
minimize the annotations and especially the RutaBasic annotations. These
are automatically created and build up a minimal, atomic partioning of
the document. This means that you should create only annotations as
small as you need them, and only annotations where you need them. The
first option here is to remove/replace the seeder if you do not rely on
these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a
tokenizer if you did not include one anyway. This will get you rid of
the annotations for whitespaces and so on and the corresponding
RutaBasic annotations. Maybe you also do not need any kind of annotation
for each section (e.g, restrict the matching window). Optimization
strongly depends on the use case and the actual rules.

Please mind that text spans without any annotations will be considered
invisible concerning sequential matching.

btw, the speed of you rules can be improved, especially with the
upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest
conditions in Ruta. I'd rather recommend something like:

Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};};



Am 09.08.2016 um 12:37 schrieb Armin.Wegner@bka.bund.de:
> Hello again!
> One down, one to go. Are there best practices or tricks to reduce Ruta's memory needs?
I tried to use the following script to merge names. 
> Document{->GREEDYANCHORING(true)};
> First+ Full {->MARK(Full)};
> Full Last+ {->MARK(Full)};
> First+ Last+ {->MARK(Full)};
> Document{->GREEDYANCHORING(false)};
> Full{PARTOFNEQ(Full) -> UNMARK(Full)};
> First{PARTOF(Full) -> UNMARK(First)};
> Last{PARTOF(Full) -> UNMARK(Last)};
> The engine description is create by ruta-maven-plugin:2.4.0 and used with uimaFIT's AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension").
For a 18 Mbyte text, it needs Gbytes of RAM.
> Cheers,
> Armin

View raw message