uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Holmberg" <holmberg2...@comcast.net>
Subject Re: Scale out using multiple Collection Readers and Cas Consumers
Date Tue, 30 Nov 2010 23:13:10 GMT
On Tue, 30 Nov 2010 13:23:27 -0800, Eddie Epstein <eaepstein@gmail.com>  

> I agree with Jerry that there is no code in UIMA packages explicitly
> for this. I'd suggest looking at
> examples/src/org/apache/uima/examples/casMultiplier/SimpleTextSegmenter.java
> for an example CasMultiplier that can easily be adapted. Another
> suggestion is to assemble and test the aggregate before deploying it
> as a service. Much easier to debug.

OK, so, bottom line, if we want to process terabytes of text contained in  
millions of files, and we want to do it in a cluster of hundreds of  
machines, and we want that cluster to scale linearly and infinitely  
without bottle-necks, and we want to use UIMA-AS to do it, then we've got  
a lot of work ahead us?  There's no existing example configurations or  
code that shows how to do this?

If we did do that work, are you confident that AS doesn't have any  
inherent bottle-necks that would prevent scaling to that level?  Was it  
designed to do that kind of thing?  The multiple Collection Reader idea  
wouldn't really be able to do that, would it?

What if there's no obvious way to partition the file set?  Say, for  
example, we're crawling a web site, like amazon.com?

What if the file set is not known (and so can't be partitioned), such as  
if we have an on-demand service that is receiving a steady series of  
random job submissions from different clients, each wanting to process  
different doc sets from different repositories?  How could AS be  
configured to ensure efficient use of the hardware (load balanced, all CPU  
cores at 100%)?  And fairness to the competing clients?

The AS architecture has always been a bit fuzzy to me.  Any insights on  
how to achieve extreme scalability with AS would be appreciated.

Greg Holmberg

View raw message