uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Holmberg" <holmberg2...@comcast.net>
Subject Re: Scale out using multiple Collection Readers and Cas Consumers
Date Thu, 02 Dec 2010 06:09:49 GMT
Hi Eddie--



> My experiences with UIMA AS are mostly with applications deployed
> on a single cluster of multi-core machines interconnected with a high
> performance network.

By "high performance" you mean something more than gigabit ethernet, like  
Infiniband or 10 GB optical fiber?

> The largest cluster we have worked with is several
> hundred nodes. We see hundreds of MB/sec of data flowing between
> clients and services thru a single broker. The load is evenly distributed
> among all instances of a service type. Client requests are processed
> in the order they are queued.

I'm having trouble picturing this system landscape--could you describe how  
the various pieces of data (content, control messages, status messages,  
etc.) move through the system, from document source (or app) to result  
database ?  I'd like to see where the network I/O is and where the disk  
I/O is, and what data formats are used.

By "broker" do you mean Active MQ?

How do clients submit requests to the cluster?  Do you support non-Java  
clients?  What does a request contain?  Can the client monitor the  
progress of a request?

Is the broker a bottle-neck?  Does all content pass through it?  How many  
times does each document (in one form or another) pass through the broker?

How does a web crawler fit into the system?

Does one request have to completely finish before another can start?  Are  
there priorities?  What about requests from interactive application, where  
the user is waiting?

Given that document processing time varies significantly, and different  
requests may use different aggregate engines, how do you manage to keep  
all the CPUs equally (and hopefully fully) busy?

How does a client get the annotators that it needs deployed into the  
cluster?

Is every machine performing the same function, or do they specialize in a  
particular annotator?  That is, is an aggregate engine self-contained in a  
single JVM, or is it split over multiple machines?

If a machine crashes, can there be data loss?  How do you recover?

Can you increase or decrease the capacity of the system without disrupting  
service?

So many questions, I know.  But I think these are legitimate issues when  
building a system, and I don't see how AS handles them.  Someone really  
needs to write a paper...

> The strength of UIMA AS is to easily scale out pipelines that
> exceed the processing resources of individual nodes with no changes to
> annotator and flow controller code or descriptors. Achieving high
> CPU utilization may require a bit of sophistication, as always, but
> UIMA AS includes the tools to facilitate that process.

Really? To me AS seems more like a box of Legos and a picture (but no  
instructions) of a really cool airplane you can build if you've got the  
time and expertise.

Sorry about that.  I'm just having a really hard time seeing how to build  
a reliable, scalable, efficient document processing service on AS.  It's  
seems more theoretical than practical.


Greg Holmberg

Mime
View raw message