uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eddie Epstein <eaepst...@gmail.com>
Subject Re: Scale out using multiple Collection Readers and Cas Consumers
Date Wed, 01 Dec 2010 13:39:57 GMT
Hi Greg,

> OK, so, bottom line, if we want to process terabytes of text contained in
> millions of files, and we want to do it in a cluster of hundreds of
> machines, and we want that cluster to scale linearly and infinitely without
> bottle-necks, and we want to use UIMA-AS to do it, then we've got a lot of
> work ahead us?  There's no existing example configurations or code that
> shows how to do this?
>
> If we did do that work, are you confident that AS doesn't have any inherent
> bottle-necks that would prevent scaling to that level?  Was it designed to
> do that kind of thing?  The multiple Collection Reader idea wouldn't really
> be able to do that, would it?

There are no claims that UIMA AS will scale indefinitely. The design in
figure 5 simply eliminates the bottleneck of a single collection reader.
As yet there is no code for distributed readers offered.

>
> What if there's no obvious way to partition the file set?  Say, for example,
> we're crawling a web site, like amazon.com?
>
> What if the file set is not known (and so can't be partitioned), such as if
> we have an on-demand service that is receiving a steady series of random job
> submissions from different clients, each wanting to process different doc
> sets from different repositories?  How could AS be configured to ensure
> efficient use of the hardware (load balanced, all CPU cores at 100%)?  And
> fairness to the competing clients?
>
> The AS architecture has always been a bit fuzzy to me.  Any insights on how
> to achieve extreme scalability with AS would be appreciated.

My experiences with UIMA AS are mostly with applications deployed
on a single cluster of multi-core machines interconnected with a high
performance network. The largest cluster we have worked with is several
hundred nodes. We see hundreds of GB/sec of data flowing between
clients and services thru a single broker. The load is evenly distributed
among all instances of a service type. Client requests are processed
in the order they are queued.

The strength of UIMA AS is to easily scale out pipelines that
exceed the processing resources of individual nodes with no changes to
annotator and flow controller code or descriptors. Achieving high
CPU utilization may require a bit of sophistication, as always, but
UIMA AS includes the tools to facilitate that process.

Eddie

Mime
View raw message