uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hugues de Mazancourt <hug...@mazancourt.com>
Subject Re: Limiting the memory used by an annotator ?
Date Sun, 30 Apr 2017 20:15:21 GMT
Thanks to all for your advices.
In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue with a minimal
example - which would advocate for the « TooManyMatchesException » feature you propose.
I vote for it.

Of course, I already limit the size of input texts, but this is not enough.
One of the main strengths of UIMA is to be able to integrate annotators produced by third-parties.
And each annotator is based on assumptions, at least to have a text as an input, formed by
words, etc. Thus, pipelines get more and more complex, without the need to code all processig.
But, in a production environment, anything can happen, assumptions may not be respected (e.g.
non-textual data can be sent to the engine(s), etc). Sh** always happen in production.

My case is a more specific one, but I’m sure it can be generalized.

Thus, any feature that can help limiting the damage of non-expected input would be welcome.
And a limited-size FsIndexRepository seems to me a simple yet powerful enough solution to
many problems.

Best,

— Hugues


PS: appart from occasional problems, Ruta is a great platform for information extraction.
I love it!

> Le 30 avr. 2017 à 12:57, Peter Klügl <peter.kluegl@averbis.com> a écrit :
> 
> Hi,
> 
> 
> here are some ruta-specific comments additionally to Thilo and Marshall's answers.
> 
> - if you do not want to split the CAS in smaller ones, you can also sometimes apply the
rules just on some parts of the document (-> less annotations/rule matches created)
> 
> - there is an discussion related to this topic (about memory usage in ruta): https://issues.apache.org/jira/browse/UIMA-5306
> 
> - I can include configuration parameters which limit the allowed amount of rule matches
and rule element matches of one rule/rule element. If a rule or rule element exceeds it, a
new runtime exception is thrown. I'll open a jira ticket for that. This is not a solution
for the problem in my opinion, but it can help to identify and fix the problematic rules.
> 
> - I do not want to include code to directly restrict the max memory in ruta. That should
rather happen in the framework or in the code that calls/applies the ruta analysis engine.
> 
> - I think there is a problem in ruta and there are several aspects that need to be considered
here: the actual rules, the partitioning with RutaBasic, flaws in the implementation and the
configuration parameters of the analysis engine
> 
> - Are the rules inefficient (combinatory explosion)? I see ruta more and more as a programming
language for faster creating maintainable analysis engines. You can write efficient and ineffiecient
code. If the code/rules are too slow or take too long, you should refactor it and replace
them with a more efficient approach. Something like ANY+ is a good indicator that the rules
are not optimal, you should only match on things if you have to. There is also profiling functionality
in the Ruta Workbench which shows you how long which rule took and how long specific conditions/action
took. Well, this is information about the speed but not about the memory, but many rule matches
take longer and require more memory, so it could be an indicator.
> 
> - There are two specific aspects how ruta spends its memory: RutaBasic and RuleMatches.
RutaBasic stores additional information which speeds up the rule inference and enables specific
functionality. The rule matches are needed to remember where something matched, for the conditions
and actions. You can reduce the memory usage by reducing the amount of RutaBasic annotations,
the amount of the annotations indexed in the RutaBasic annotations, or by reducing the amount
of RuleMatches -> refactoring the rules.
> 
> - There are plans to make the implementation of RutaBasic more efficient, by using more
efficient data structures (there are some prototypes mentioned in the issue linked above).
And I added some new configuration parameters (in ruta 2.6.0 I think) which control which
information is stored in RutaBasic, e.g, you do not need information about annotations if
they or their types are not used in the rules.
> 
> - I think there is a flaw in the implementation which causes your problem, and which
can be fixed. I'll investigate it when I find the time. If you can provide some minimal (synthetic)
example for reproducing it, that would be great.
> 
> - There is the configuration parameter lowMemoryProfile for reducing the stuff stored
in RutaBasic which reduces the memory usage but makes the rules run slower.
> 
> 
> Best,
> 
> 
> Peter
> 
> 
> 
> Am 29.04.2017 um 12:53 schrieb Hugues de Mazancourt:
>> Hello UIMA users,
>> 
>> I’m currently putting a Ruta-based system in production and I sometimes run out
of memory.
>> This is usually caused by combinatory explosion in Ruta rules. These rules are not
necessary faulty: they are adapted to the documents I expect to parse. But as this is an open
system, people can upload whatever they want and the parser crashes by multiplying annotations
(or at least takes 20 minutes in garbage-collecting millions of annotations).
>> 
>> Thus, my question is: is there a way to limit the memory used by an annotator, or
to limit the number of annotations made by an annotator, or to limit the number of matches
made by Ruta ?
>> I prefer cancelling a parse for a given document than a 20 minutes downtime of the
whole system.
>> 
>> Several UIMA-based services run in production, I guess that others certainly have
hit the same problem.
>> 
>> Any hint on that topic would be very helpful.
>> 
>> Thanks,
>> 
>> Hugues de Mazancourt
>> http://about.me/mazancourt
>> 
>> 
>> 
>> 
>> 
> 


Mime
View raw message