uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Holmberg" <holmberg2...@comcast.net>
Subject Re: remoteAnalysisEngine services not scaling to effect
Date Mon, 26 Sep 2011 20:31:28 GMT

I don't know what the cause of your specific technical issue is, but in my  
opinion, there's a better way to slice the problem.

What you're doing is taking each step in your analysis engine and running  
it on one or more machines.  The creates two problems.

One, it's a lot of network overhead.  You're moving each document across  
the network many times.  You can easily spend more time just moving the  
data around than actually processing.  It also creates a low ceiling to  
scalability, since you chew up a lot of network bandwidth.

Two, in order to use your hardware efficiently, you have to get the right  
ratio of machines/CPUs for each step.  Some steps use more cycles than  
others.  For example, you might find that for a given configuration and  
set of documents that the ratio of CPU usage for steps A, B, and C are  
1:5:2.  Now you need to instantiate A, B, and C services to use cores in  
that ratio.  Then, suppose you want to add more machines--how should you  
allocate them to A, B, and C?  It will always be lumpy, with some cores  
not being used much.  But worse, with a different configuration (different  
dictionaries, for example), or with different documents (longer vs.  
shorter, for example), the ratios will change, and you will have to  
reconfigure your machines again.  It's never-ending, and it's never  
completely right.

So, it would be much easier to manage and more efficient, more scalable,  
if you just run your analysis engine self-contained in a single process,  
and then replicate the engine over your machines/CPUs.  You slice by  
document, not by service--send each document to a different analysis  
engine instance.  This makes your life easier, always runs the CPUs at  
100%, and scales indefinitely.  Just add more machines, it goes faster.

This is what I'm doing.  I use JavaSpaces (producer/consumer queue), but  
I'm sure you can get the same effect with UIMA AS and ActiveMQ.


View raw message