uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: Ruta 2.4.0 - High memory needs
Date Thu, 18 Aug 2016 12:38:45 GMT
I'll check that (writing some unit test right now)


Am 18.08.2016 um 14:36 schrieb Armin.Wegner@bka.bund.de:
> Hi Peter,
>
> doesn't work like that for me. I've removed DefaultSeeder and added my own seeder implementing
RutaAnnotationSeeder. Now, I have all of Ruta's standard tokens plus my own tokenization at
the same time.
>
> Cheers,
> Armin
>
> -----Ursprüngliche Nachricht-----
> Von: Peter Klügl [mailto:peter.kluegl@averbis.com] 
> Gesendet: Donnerstag, 18. August 2016 14:23
> An: user@uima.apache.org
> Betreff: Re: Ruta 2.4.0 - High memory needs
>
> Hi,
>
>
> Am 18.08.2016 um 14:17 schrieb Armin.Wegner@bka.bund.de:
>> Hello Peter!
>>
>> Please correct me if I'm wrong. My understanding of how Ruta works is as follows.

>>
>> 1. The RutaBasic annotations are always created. RETAINTYPE and FILTERTYPE have no
influence of annotation creation. They influence the use of those types in rules, only.
>>
>
> yes
>
>
>> 2. The configuration parameter seeders adds additional seeders, only. It cannot be
used to remove the default seeder.
> No, the parameter specifies all seeder. The default value is is set to
> the default seeder. If you set it to an empty list, no seeders should be
> applied. If you want to use your own seeder, you simply set the
> parameter to your implementation.
>
> (I am really sure of that, but I will check it again...)
>
>
> Best,
>
> Peter
>
>> So how do I tell Ruta not to use the default seeder? How do I tell Ruta to use my
own seeder? Do I have to replace org.apache.uima.ruta.seed.DefaultSeeder.java? Won't this
break Ruta?
>>
>> Best,
>> Armin
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Peter Klügl [mailto:peter.kluegl@averbis.com] 
>> Gesendet: Mittwoch, 10. August 2016 14:50
>> An: user@uima.apache.org
>> Betreff: Re: Ruta 2.4.0 - High memory needs
>>
>> Hi,
>>
>>
>> 18MB of text in a CAS, well that's a quite big sofa.
>>
>>
>> Yes, there are some tricks and best prectices.
>>
>>
>> First of all, there is the configuration parameter "lowMemoryProfile",
>> which reduces the information stored in RutaBasic. It should reduce the
>> memory usage considerably, but the processing will take longer,
>> especially if the type hierarchy is rather deep. The unit tests for it
>> do not cover all functionality of ruta. I only test all unit test with
>> this option once in a while, and I haven't done this for some time.
>>
>>  
>>
>> The second thing to do in order to reduce the memory usage is to
>> minimize the annotations and especially the RutaBasic annotations. These
>> are automatically created and build up a minimal, atomic partioning of
>> the document. This means that you should create only annotations as
>> small as you need them, and only annotations where you need them. The
>> first option here is to remove/replace the seeder if you do not rely on
>> these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a
>> tokenizer if you did not include one anyway. This will get you rid of
>> the annotations for whitespaces and so on and the corresponding
>> RutaBasic annotations. Maybe you also do not need any kind of annotation
>> for each section (e.g, restrict the matching window). Optimization
>> strongly depends on the use case and the actual rules.
>>
>> Please mind that text spans without any annotations will be considered
>> invisible concerning sequential matching.
>>
>>
>> btw, the speed of you rules can be improved, especially with the
>> upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest
>> conditions in Ruta. I'd rather recommend something like:
>>
>> Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};};
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>> Am 09.08.2016 um 12:37 schrieb Armin.Wegner@bka.bund.de:
>>> Hello again!
>>>
>>> One down, one to go. Are there best practices or tricks to reduce Ruta's memory
needs? I tried to use the following script to merge names. 
>>>
>>> Document{->GREEDYANCHORING(true)};
>>> First+ Full {->MARK(Full)};
>>> Full Last+ {->MARK(Full)};
>>> First+ Last+ {->MARK(Full)};
>>> Document{->GREEDYANCHORING(false)};
>>> Full{PARTOFNEQ(Full) -> UNMARK(Full)};
>>> First{PARTOF(Full) -> UNMARK(First)};
>>> Last{PARTOF(Full) -> UNMARK(Last)};
>>>
>>> The engine description is create by ruta-maven-plugin:2.4.0 and used with uimaFIT's
AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension").
For a 18 Mbyte text, it needs Gbytes of RAM.
>>>
>>> Cheers,
>>> Armin


Mime
View raw message