uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From holmberg2066@comcast.net (g...@holmberg.name)
Subject Re: (one other) Multithreading question
Date Mon, 12 Nov 2007 23:16:54 GMT

Of course, it's hard for me to diagnose your cluster from just this information, so maybe
I'm missing something, but I can't see how taking a system in which some threads are indirectly
blocking due to I/O (sockets with Oracle) and directly making the threads block through synchronization
is going to help anything.

Unless you think the problem is that Oracle is thrashing and would have more throughput with
fewer requests.  I would guess, given the engineering resources that Oracle has and the uses
it's customers have put it to, that Oracle can deal with hundreds of simultaneous requests.

If I were you, I would first locate the bottle-neck.  Is it network bandwidth?  NIC bandwidth?
 Oracle Disk I/O?  Paging on the Oracle box?  Front-side bus bandwidth (common on multi-core

Given my experience building a similar clustered system on UIMA, the first thing I'd look
at is the bandwidth usage.  With 15 nodes (you don't say how many cores--let me guess 60),
you probably don't need 224 threads to keep the CPUs busy.  I run just a few more threads
than I have cores (maybe 10% more).  The trick is to design your software so that the document
is on the network from its storage source to its pipeline exactly once.  Then all annotators
must be local, in the same JVM, so all data movement in the pipeline is in the same address
space.  Then put results on the network exactly once, from pipeline to storage destination.
 You should be able to get the pipeline to the point where it is spending less than 5% of
it's elapsed time for a document blocking on I/O.

I've measured the bandwidth at the TCP/IP level of gigabit networking at 57 megabytes/sec
out of a single machine/NIC.  That would be a ceiling, and I'm sure an Oracle instance will
be well below that, due to disk I/O.  So measure your throughput (keeping in mind the data
expansion in whatever protocol your using--hopefully not SOAP!), and compare it to 57.  This
will give you some idea of how much room for improvement you have, and whether the bottleneck
is network I/O or something else.  A tool like Ethereal may be useful here.

If it's something else, start looking at the CPU and disk usage on the Oracle box.  Maybe
more RAM and a bigger SGA would help.  Maybe RAID-0 disks would help.  It all depends on exactly
what the problem is.  You probably need an Oracle expert.

Hope this helps,

Greg Holmberg

 -------------- Original message ----------------------
From: Steve Suppe <ssuppe@llnl.gov>
> A slightly different (but related question):
> I've been playing around with this type of computation.  We are loading 
> data into a DB.  We have a small Linux cluster (15 multi-core nodes at the 
> moment) that we have scaled up to run 224+ instances of our pipeline.  I've 
> noticed for most of our calculations, it's really Oracle that is holding us 
> back.
> In some instances, a 'computation' is classified as one 'medium to long' 
> data pull from Oracle, a bunch of analysis, and then 'small to large' 
> insertions of results.  I've dabbled in placing static DB connections and 
> mutexes through the code to guarantee that the instances on a machine only 
> access the DB one at a time, but are free to run analysis simultaneously 
> otherwise.
> I have also toyed with the idea of locks that allow N number of connections 
> (instead of only  the mutual exclusion one at a time) so that I can 
> increase the connections to a point, but not overload the system.
> Has anyone tried anything like this?  Or is anyone else at least running a 
> similar hardware set-up?  It would be great to compare note.
> Thanks,
> Steve
> At 04:09 AM 11/12/2007, Marshall Schor wrote:
> >This may not be quite precise enough.  Your Annotators will be
> >instantiated multiple times,  so that a single *instance* of an
> >annotator will not be run on multiple threads at once.  So - if you have
> >non-static fields in your annotator, they do not need to be accessed
> >with threading in mind. But if you make use of "static" fields, there is
> >only one instance of these, so access to them must be thread-safe.
> >
> >If your *application*  (not your annotator) is multi-threaded, it will
> >need to be thread-safe. You can find relevant information about this in
> >the tutorial and reference docs for UIMA (search for "thread").
> >
> >-Marshall
> >
> >Michael Baessler wrote:
> > > Benjamin Sznajder wrote:
> > >> Hi all,
> > >>
> > >> I am interested in using multi-threading in UIMA.
> > >> My aim is that the flow runs several annotators in parallel.
> > >> One of my annotators is not thread-safe. My question is, then,
> > >> Does the UIMA parallelism ( setting "MultipleDeployment=true") requires
> > >> that annotators on which this flag is set, are thread-safe?
> > >>
> > >> Regards,
> > >> Benjamin.
> > >>
> > >>
> > > Yes, your annotator have to be thread-safe when you want to run them
> > > multi-threaded.
> > >
> > > -- Michael
> > >
> > >

View raw message