uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Armin.Weg...@bka.bund.de>
Subject AW: Ruta 2.4.0 - High memory needs
Date Thu, 18 Aug 2016 14:18:42 GMT
Hello Peter,

I found it thanks to your help. There was another Ruta script maliciously hiding in the pipeline
setting up test annotations and therefore using all of Ruta's defaults. I discovered it as
I used your code from the unit test which, of course, works perfectly fine. I will create
Ruta annotators programmatically from now on, so that I have full control over all options.

I'm sorry and thanks a lot,

-----Ursprüngliche Nachricht-----
Von: Peter Klügl [mailto:peter.kluegl@averbis.com] 
Gesendet: Donnerstag, 18. August 2016 15:11
An: user@uima.apache.org
Betreff: Re: Ruta 2.4.0 - High memory needs

I found a bug (and fixed it), but it was not related to your problem.

I added a unit test where the seeder is removed:


Seems to work just fine. The problem must be located somewhere else.

Are you sure that the configuration parameter value is correct?

I'll write another unit test...



Am 18.08.2016 um 14:38 schrieb Peter Klügl:
> I'll check that (writing some unit test right now)
> Am 18.08.2016 um 14:36 schrieb Armin.Wegner@bka.bund.de:
>> Hi Peter,
>> doesn't work like that for me. I've removed DefaultSeeder and added my own seeder
implementing RutaAnnotationSeeder. Now, I have all of Ruta's standard tokens plus my own tokenization
at the same time.
>> Cheers,
>> Armin
>> -----Ursprüngliche Nachricht-----
>> Von: Peter Klügl [mailto:peter.kluegl@averbis.com] 
>> Gesendet: Donnerstag, 18. August 2016 14:23
>> An: user@uima.apache.org
>> Betreff: Re: Ruta 2.4.0 - High memory needs
>> Hi,
>> Am 18.08.2016 um 14:17 schrieb Armin.Wegner@bka.bund.de:
>>> Hello Peter!
>>> Please correct me if I'm wrong. My understanding of how Ruta works is as follows.

>>> 1. The RutaBasic annotations are always created. RETAINTYPE and FILTERTYPE have
no influence of annotation creation. They influence the use of those types in rules, only.
>> yes
>>> 2. The configuration parameter seeders adds additional seeders, only. It cannot
be used to remove the default seeder.
>> No, the parameter specifies all seeder. The default value is is set to
>> the default seeder. If you set it to an empty list, no seeders should be
>> applied. If you want to use your own seeder, you simply set the
>> parameter to your implementation.
>> (I am really sure of that, but I will check it again...)
>> Best,
>> Peter
>>> So how do I tell Ruta not to use the default seeder? How do I tell Ruta to use
my own seeder? Do I have to replace org.apache.uima.ruta.seed.DefaultSeeder.java? Won't this
break Ruta?
>>> Best,
>>> Armin
>>> -----Ursprüngliche Nachricht-----
>>> Von: Peter Klügl [mailto:peter.kluegl@averbis.com] 
>>> Gesendet: Mittwoch, 10. August 2016 14:50
>>> An: user@uima.apache.org
>>> Betreff: Re: Ruta 2.4.0 - High memory needs
>>> Hi,
>>> 18MB of text in a CAS, well that's a quite big sofa.
>>> Yes, there are some tricks and best prectices.
>>> First of all, there is the configuration parameter "lowMemoryProfile",
>>> which reduces the information stored in RutaBasic. It should reduce the
>>> memory usage considerably, but the processing will take longer,
>>> especially if the type hierarchy is rather deep. The unit tests for it
>>> do not cover all functionality of ruta. I only test all unit test with
>>> this option once in a while, and I haven't done this for some time.
>>> The second thing to do in order to reduce the memory usage is to
>>> minimize the annotations and especially the RutaBasic annotations. These
>>> are automatically created and build up a minimal, atomic partioning of
>>> the document. This means that you should create only annotations as
>>> small as you need them, and only annotations where you need them. The
>>> first option here is to remove/replace the seeder if you do not rely on
>>> these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a
>>> tokenizer if you did not include one anyway. This will get you rid of
>>> the annotations for whitespaces and so on and the corresponding
>>> RutaBasic annotations. Maybe you also do not need any kind of annotation
>>> for each section (e.g, restrict the matching window). Optimization
>>> strongly depends on the use case and the actual rules.
>>> Please mind that text spans without any annotations will be considered
>>> invisible concerning sequential matching.
>>> btw, the speed of you rules can be improved, especially with the
>>> upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest
>>> conditions in Ruta. I'd rather recommend something like:
>>> Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};};
>>> Best,
>>> Peter
>>> Am 09.08.2016 um 12:37 schrieb Armin.Wegner@bka.bund.de:
>>>> Hello again!
>>>> One down, one to go. Are there best practices or tricks to reduce Ruta's
memory needs? I tried to use the following script to merge names. 
>>>> Document{->GREEDYANCHORING(true)};
>>>> First+ Full {->MARK(Full)};
>>>> Full Last+ {->MARK(Full)};
>>>> First+ Last+ {->MARK(Full)};
>>>> Document{->GREEDYANCHORING(false)};
>>>> Full{PARTOFNEQ(Full) -> UNMARK(Full)};
>>>> First{PARTOF(Full) -> UNMARK(First)};
>>>> Last{PARTOF(Full) -> UNMARK(Last)};
>>>> The engine description is create by ruta-maven-plugin:2.4.0 and used with
uimaFIT's AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension").
For a 18 Mbyte text, it needs Gbytes of RAM.
>>>> Cheers,
>>>> Armin

View raw message