uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eddie Epstein <eaepst...@gmail.com>
Subject Re: Scale out using multiple Collection Readers and Cas Consumers
Date Thu, 02 Dec 2010 14:23:56 GMT
Wow, that's a lot of questions ... and here we go ...

On Thu, Dec 2, 2010 at 1:09 AM, Greg Holmberg <holmberg2066@comcast.net> wrote:
> Hi Eddie--
>> My experiences with UIMA AS are mostly with applications deployed
>> on a single cluster of multi-core machines interconnected with a high
>> performance network.
> By "high performance" you mean something more than gigabit ethernet, like
> Infiniband or 10 GB optical fiber?

For us just 1gigE and 10gigE so far.

>> The largest cluster we have worked with is several
>> hundred nodes. We see hundreds of MB/sec of data flowing between
>> clients and services thru a single broker. The load is evenly distributed
>> among all instances of a service type. Client requests are processed
>> in the order they are queued.
> I'm having trouble picturing this system landscape--could you describe how
> the various pieces of data (content, control messages, status messages,
> etc.) move through the system, from document source (or app) to result
> database ?  I'd like to see where the network I/O is and where the disk I/O
> is, and what data formats are used.

Several different systems, each different. In one case, a multimodal
speech-to-speech system, the SofaURI was used to flow audio
data via a separate audio media controller. CAS flow contained control
and result information. In other systems CASes contain pointers to data
on NFS. Others just put all the data in the CAS.

> By "broker" do you mean Active MQ?


> How do clients submit requests to the cluster?  Do you support non-Java
> clients?  What does a request contain?  Can the client monitor the progress
> of a request?

The UIMA AS client API is only in Java as of now. It is possible to create
clients in other programming languages because AMQ client code exists
for several others. We use the AMQ-C++ client code for UIMA C++ services,
but as yet have had no need for C++ clients. Interactive applications typically
use http servlets as clients to backend UIMA AS services.

> Is the broker a bottle-neck?  Does all content pass through it?  How many
> times does each document (in one form or another) pass through the broker?

Although conceptually the broker is a bottleneck, we have not seen it.
There are workarounds, for example, it is easy to deploy different services
using different brokers.

> How does a web crawler fit into the system?
> Does one request have to completely finish before another can start?  Are
> there priorities?  What about requests from interactive application, where
> the user is waiting?

The UIMA AS client API has both synchronous and async interfaces to
process(CAS). As with any interactive application, the number of services
must support latency requirements. Client side process timeouts are available.

> Given that document processing time varies significantly, and different
> requests may use different aggregate engines, how do you manage to keep all
> the CPUs equally (and hopefully fully) busy?

Ideally this is simple: 1. configure each server node to maximize CPU
if all service instances are busy; 2. make sure the CAS pools in the clients are
sufficient to keep all service instances busy.

> How does a client get the annotators that it needs deployed into the
> cluster?

See other threads on service life cycle management. Short answer, this
is currently outside UIMA AS code.

> Is every machine performing the same function, or do they specialize in a
> particular annotator?  That is, is an aggregate engine self-contained in a
> single JVM, or is it split over multiple machines?

All are true. Depends on the analytics.

> If a machine crashes, can there be data loss?  How do you recover?
> Can you increase or decrease the capacity of the system without disrupting
> service?

Interactive systems are designed with redundancy with no single point
of failure.
If a UIMA AS request times out it should be resubmitted. UIMA AS
service instances can be added or removed dynamically at runtime.

> So many questions, I know.  But I think these are legitimate issues when
> building a system, and I don't see how AS handles them.  Someone really
> needs to write a paper...

There is a paper on the use of UIMA AS for GALE. I'm sure more will come.

>> The strength of UIMA AS is to easily scale out pipelines that
>> exceed the processing resources of individual nodes with no changes to
>> annotator and flow controller code or descriptors. Achieving high
>> CPU utilization may require a bit of sophistication, as always, but
>> UIMA AS includes the tools to facilitate that process.
> Really? To me AS seems more like a box of Legos and a picture (but no
> instructions) of a really cool airplane you can build if you've got the time
> and expertise.

Definitely some truth to this.

> Sorry about that.  I'm just having a really hard time seeing how to build a
> reliable, scalable, efficient document processing service on AS.  It's seems
> more theoretical than practical.
> Greg Holmberg

It would be nice to have a turnkey application for UIMA AS. So far we have been
focusing on getting full coverage for all UIMA functionality as well
as maximizing
performance at all levels of the runtime system.


View raw message