uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: Limiting the memory used by an annotator ?
Date Sun, 30 Apr 2017 10:57:54 GMT
Hi,


here are some ruta-specific comments additionally to Thilo and 
Marshall's answers.

- if you do not want to split the CAS in smaller ones, you can also 
sometimes apply the rules just on some parts of the document (-> less 
annotations/rule matches created)

- there is an discussion related to this topic (about memory usage in 
ruta): https://issues.apache.org/jira/browse/UIMA-5306

- I can include configuration parameters which limit the allowed amount 
of rule matches and rule element matches of one rule/rule element. If a 
rule or rule element exceeds it, a new runtime exception is thrown. I'll 
open a jira ticket for that. This is not a solution for the problem in 
my opinion, but it can help to identify and fix the problematic rules.

- I do not want to include code to directly restrict the max memory in 
ruta. That should rather happen in the framework or in the code that 
calls/applies the ruta analysis engine.

- I think there is a problem in ruta and there are several aspects that 
need to be considered here: the actual rules, the partitioning with 
RutaBasic, flaws in the implementation and the configuration parameters 
of the analysis engine

- Are the rules inefficient (combinatory explosion)? I see ruta more and 
more as a programming language for faster creating maintainable analysis 
engines. You can write efficient and ineffiecient code. If the 
code/rules are too slow or take too long, you should refactor it and 
replace them with a more efficient approach. Something like ANY+ is a 
good indicator that the rules are not optimal, you should only match on 
things if you have to. There is also profiling functionality in the Ruta 
Workbench which shows you how long which rule took and how long specific 
conditions/action took. Well, this is information about the speed but 
not about the memory, but many rule matches take longer and require more 
memory, so it could be an indicator.

- There are two specific aspects how ruta spends its memory: RutaBasic 
and RuleMatches. RutaBasic stores additional information which speeds up 
the rule inference and enables specific functionality. The rule matches 
are needed to remember where something matched, for the conditions and 
actions. You can reduce the memory usage by reducing the amount of 
RutaBasic annotations, the amount of the annotations indexed in the 
RutaBasic annotations, or by reducing the amount of RuleMatches -> 
refactoring the rules.

- There are plans to make the implementation of RutaBasic more 
efficient, by using more efficient data structures (there are some 
prototypes mentioned in the issue linked above). And I added some new 
configuration parameters (in ruta 2.6.0 I think) which control which 
information is stored in RutaBasic, e.g, you do not need information 
about annotations if they or their types are not used in the rules.

- I think there is a flaw in the implementation which causes your 
problem, and which can be fixed. I'll investigate it when I find the 
time. If you can provide some minimal (synthetic) example for 
reproducing it, that would be great.

- There is the configuration parameter lowMemoryProfile for reducing the 
stuff stored in RutaBasic which reduces the memory usage but makes the 
rules run slower.


Best,


Peter



Am 29.04.2017 um 12:53 schrieb Hugues de Mazancourt:
> Hello UIMA users,
>
> I’m currently putting a Ruta-based system in production and I sometimes run out of
memory.
> This is usually caused by combinatory explosion in Ruta rules. These rules are not necessary
faulty: they are adapted to the documents I expect to parse. But as this is an open system,
people can upload whatever they want and the parser crashes by multiplying annotations (or
at least takes 20 minutes in garbage-collecting millions of annotations).
>
> Thus, my question is: is there a way to limit the memory used by an annotator, or to
limit the number of annotations made by an annotator, or to limit the number of matches made
by Ruta ?
> I prefer cancelling a parse for a given document than a 20 minutes downtime of the whole
system.
>
> Several UIMA-based services run in production, I guess that others certainly have hit
the same problem.
>
> Any hint on that topic would be very helpful.
>
> Thanks,
>
> Hugues de Mazancourt
> http://about.me/mazancourt
>
>
>
>
>


Mime
View raw message