uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Suppe <ssu...@llnl.gov>
Subject Re: (one other) Multithreading question
Date Tue, 13 Nov 2007 16:05:05 GMT
I suppose my thought was that each node was trying to pull info from the DB 
at the same time and then each taking a variance in time to do its 
analysis.  So I was trying to enable the nodes to do their local processing 
simultaneously, but try to mitigate the thrashing (as you said) on oracle 
that occurs when they all sometimes finish near each other.

You are correct, however - I should be taking a more engineering 
approach.  Some of the things you've mentioned we've already taken steps to 
do (such as trying to keep net bandwidth down by keeping annotators local 
to a node).  However, I have no real metrics (the old standby, trying to 
find time to do it...).  Are there other open source tools you recommend 
besides Ethereal?  I used to use Ethereal back in the day for network 
security aspects (in my other life), but had no idea it could provide 
performance metrics.

Anyhow, thanks to you and Thilo again for the help this far, even if you 
don't have any more time!


At 03:16 PM 11/12/2007, greg@holmberg.name wrote:
>Of course, it's hard for me to diagnose your cluster from just this 
>information, so maybe I'm missing something, but I can't see how taking a 
>system in which some threads are indirectly blocking due to I/O (sockets 
>with Oracle) and directly making the threads block through synchronization 
>is going to help anything.
>Unless you think the problem is that Oracle is thrashing and would have 
>more throughput with fewer requests.  I would guess, given the engineering 
>resources that Oracle has and the uses it's customers have put it to, that 
>Oracle can deal with hundreds of simultaneous requests.
>If I were you, I would first locate the bottle-neck.  Is it network 
>bandwidth?  NIC bandwidth?  Oracle Disk I/O?  Paging on the Oracle 
>box?  Front-side bus bandwidth (common on multi-core machines)?
>Given my experience building a similar clustered system on UIMA, the first 
>thing I'd look at is the bandwidth usage.  With 15 nodes (you don't say 
>how many cores--let me guess 60), you probably don't need 224 threads to 
>keep the CPUs busy.  I run just a few more threads than I have cores 
>(maybe 10% more).  The trick is to design your software so that the 
>document is on the network from its storage source to its pipeline exactly 
>once.  Then all annotators must be local, in the same JVM, so all data 
>movement in the pipeline is in the same address space.  Then put results 
>on the network exactly once, from pipeline to storage destination.  You 
>should be able to get the pipeline to the point where it is spending less 
>than 5% of it's elapsed time for a document blocking on I/O.
>I've measured the bandwidth at the TCP/IP level of gigabit networking at 
>57 megabytes/sec out of a single machine/NIC.  That would be a ceiling, 
>and I'm sure an Oracle instance will be well below that, due to disk 
>I/O.  So measure your throughput (keeping in mind the data expansion in 
>whatever protocol your using--hopefully not SOAP!), and compare it to 
>57.  This will give you some idea of how much room for improvement you 
>have, and whether the bottleneck is network I/O or something else.  A tool 
>like Ethereal may be useful here.
>If it's something else, start looking at the CPU and disk usage on the 
>Oracle box.  Maybe more RAM and a bigger SGA would help.  Maybe RAID-0 
>disks would help.  It all depends on exactly what the problem is.  You 
>probably need an Oracle expert.
>Hope this helps,
>Greg Holmberg
>  -------------- Original message ----------------------
>From: Steve Suppe <ssuppe@llnl.gov>
> > A slightly different (but related question):
> >
> > I've been playing around with this type of computation.  We are loading
> > data into a DB.  We have a small Linux cluster (15 multi-core nodes at the
> > moment) that we have scaled up to run 224+ instances of our 
> pipeline.  I've
> > noticed for most of our calculations, it's really Oracle that is 
> holding us
> > back.
> >
> > In some instances, a 'computation' is classified as one 'medium to long'
> > data pull from Oracle, a bunch of analysis, and then 'small to large'
> > insertions of results.  I've dabbled in placing static DB connections and
> > mutexes through the code to guarantee that the instances on a machine only
> > access the DB one at a time, but are free to run analysis simultaneously
> > otherwise.
> >
> > I have also toyed with the idea of locks that allow N number of 
> connections
> > (instead of only  the mutual exclusion one at a time) so that I can
> > increase the connections to a point, but not overload the system.
> >
> > Has anyone tried anything like this?  Or is anyone else at least running a
> > similar hardware set-up?  It would be great to compare note.
> >
> > Thanks,
> > Steve
> >
> > At 04:09 AM 11/12/2007, Marshall Schor wrote:
> > >This may not be quite precise enough.  Your Annotators will be
> > >instantiated multiple times,  so that a single *instance* of an
> > >annotator will not be run on multiple threads at once.  So - if you have
> > >non-static fields in your annotator, they do not need to be accessed
> > >with threading in mind. But if you make use of "static" fields, there is
> > >only one instance of these, so access to them must be thread-safe.
> > >
> > >If your *application*  (not your annotator) is multi-threaded, it will
> > >need to be thread-safe. You can find relevant information about this in
> > >the tutorial and reference docs for UIMA (search for "thread").
> > >
> > >-Marshall
> > >
> > >Michael Baessler wrote:
> > > > Benjamin Sznajder wrote:
> > > >> Hi all,
> > > >>
> > > >> I am interested in using multi-threading in UIMA.
> > > >> My aim is that the flow runs several annotators in parallel.
> > > >> One of my annotators is not thread-safe. My question is, then,
> > > >> Does the UIMA parallelism ( setting "MultipleDeployment=true") 
> requires
> > > >> that annotators on which this flag is set, are thread-safe?
> > > >>
> > > >> Regards,
> > > >> Benjamin.
> > > >>
> > > >>
> > > > Yes, your annotator have to be thread-safe when you want to run them
> > > > multi-threaded.
> > > >
> > > > -- Michael
> > > >
> > > >

View raw message