uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Armin.Weg...@bka.bund.de>
Subject AW: Ruta 2.4.0 - High memory needs
Date Thu, 18 Aug 2016 12:17:11 GMT
Hello Peter!

Please correct me if I'm wrong. My understanding of how Ruta works is as follows. 

1. The RutaBasic annotations are always created. RETAINTYPE and FILTERTYPE have no influence
of annotation creation. They influence the use of those types in rules, only.

2. The configuration parameter seeders adds additional seeders, only. It cannot be used to
remove the default seeder.

So how do I tell Ruta not to use the default seeder? How do I tell Ruta to use my own seeder?
Do I have to replace org.apache.uima.ruta.seed.DefaultSeeder.java? Won't this break Ruta?


-----Urspr├╝ngliche Nachricht-----
Von: Peter Kl├╝gl [mailto:peter.kluegl@averbis.com] 
Gesendet: Mittwoch, 10. August 2016 14:50
An: user@uima.apache.org
Betreff: Re: Ruta 2.4.0 - High memory needs


18MB of text in a CAS, well that's a quite big sofa.

Yes, there are some tricks and best prectices.

First of all, there is the configuration parameter "lowMemoryProfile",
which reduces the information stored in RutaBasic. It should reduce the
memory usage considerably, but the processing will take longer,
especially if the type hierarchy is rather deep. The unit tests for it
do not cover all functionality of ruta. I only test all unit test with
this option once in a while, and I haven't done this for some time.


The second thing to do in order to reduce the memory usage is to
minimize the annotations and especially the RutaBasic annotations. These
are automatically created and build up a minimal, atomic partioning of
the document. This means that you should create only annotations as
small as you need them, and only annotations where you need them. The
first option here is to remove/replace the seeder if you do not rely on
these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a
tokenizer if you did not include one anyway. This will get you rid of
the annotations for whitespaces and so on and the corresponding
RutaBasic annotations. Maybe you also do not need any kind of annotation
for each section (e.g, restrict the matching window). Optimization
strongly depends on the use case and the actual rules.

Please mind that text spans without any annotations will be considered
invisible concerning sequential matching.

btw, the speed of you rules can be improved, especially with the
upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest
conditions in Ruta. I'd rather recommend something like:

Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};};



Am 09.08.2016 um 12:37 schrieb Armin.Wegner@bka.bund.de:
> Hello again!
> One down, one to go. Are there best practices or tricks to reduce Ruta's memory needs?
I tried to use the following script to merge names. 
> Document{->GREEDYANCHORING(true)};
> First+ Full {->MARK(Full)};
> Full Last+ {->MARK(Full)};
> First+ Last+ {->MARK(Full)};
> Document{->GREEDYANCHORING(false)};
> Full{PARTOFNEQ(Full) -> UNMARK(Full)};
> First{PARTOF(Full) -> UNMARK(First)};
> Last{PARTOF(Full) -> UNMARK(Last)};
> The engine description is create by ruta-maven-plugin:2.4.0 and used with uimaFIT's AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension").
For a 18 Mbyte text, it needs Gbytes of RAM.
> Cheers,
> Armin

View raw message