uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dr. Raja M. Suleman" <raja.m.sulai...@gmail.com>
Subject Re: UIMA DUCC slow processing
Date Sat, 13 Jun 2020 10:24:22 GMT
Hello,

Thank you very much for your response and even more so for the detailed
explanation.

So, if I understand it correctly, DUCC is more suited for scenarios where
we have large input documents rather than many small ones?

Thank you once again.

On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <eaepstein@gmail.com> wrote:

> Hi,
>
> In this simple scenario there is a CollectionReader running in a JobDriver
> process, delivering 100K workitems to multiple remote JobProcesses. The
> processing time is essentially zero.  (30 * 60 seconds) / 100,000 workitems
> = 18 milliseconds per workitem. This time is roughly the expected overhead
> of a DUCC jobDriver delivering workitems to remote JobProcesses and
> recording the results. DUCC jobs are much more efficient if the overhead
> per workitem is much smaller than the processing time.
>
> Typically DUCC jobs would be processing much larger blocks of content per
> workitem. For example, if a workitem was a document, and the document
> parsed into the small CASes by the CasMultiplier, the throughput would be
> much better. However, with this example, as the number of working
> JobProcess threads is scaled up, the CR (JobDriver) would become a
> bottleneck. That's why a typical DUCC Job will not send the Document
> content as a workitem, but rather send a reference to the workitem content
> and have the CasMultipliers in the JobProcesses read the content directly
> from the source.
>
> Even though content read by the JobProcesses is much more efficient, as
> scaleout continued to increase for this non-computation scenario the
> bottleneck would eventually move to the underlying filesystem or whatever
> document source and JobProcess output are. The main motivation for DUCC was
> jobs similar to those in the DUCC examples which use OpenNLP to process
> large documents. That is, jobs where CPU processing is the bottleneck
> rather than I/O.
>
> Hopefully this helps. If not, happy to continue the discussion.
> Eddie
>
> On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
> raja.m.sulaiman@gmail.com> wrote:
>
> > Hi,
> > Thank you for your reply and I'm sorry I couldn't get back to this
> > earlier.
> >
> > To get a better picture of the processing speed of DUCC, I made a dummy
> > pipeline where the CollectionReader runs a for loop to generate 100k
> > workitems (so no disk reads). each workitem only has a simple string in
> it.
> > These are then passed on to the CasMultiplier where for each workitem I'm
> > creating a new CAS with DocumentInfo (again only having a simple string
> > value) and pass it as a newcas to the CasConsumer. The CasConsumer
> doesn't
> > do anything except add the Document received in the CAS to the logger. So
> > basically this pipeline isn't doing anything, no Input reads and the only
> > output is the information added to the logger. Running this on the
> cluster
> > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
> than
> > 30 minutes. I don't understand how is this possible since there's no
> heavy
> > I/O processing is happening in the code.
> >
> > Any ideas please?
> >
> > Thank you.
> >
> > On 2020/05/18 12:47:41, Eddie Epstein <eaepstein@gmail.com> wrote:
> > > Hi,
> > >
> > > Removing the AE from the pipeline was a good idea to help isolate the
> > > bottleneck. The other two most likely possibilities are the collection
> > > reader pulling from elastic search or the CAS consumer writing the
> > > processing output.
> > >
> > > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > > Please give a more complete picture of the processing scenario on DUCC.
> > >
> > > Regards,
> > > Eddie
> > >
> > >
> > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > > Sulemanr@edgehill.ac.uk> wrote:
> > >
> > > > Hi,
> > > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> > nodes
> > > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> > data
> > > > from an Elasticsearch index and dump it into a new index after
> certain
> > > > analysis engine processing. The Analysis Engine is a simple sentiment
> > > > analysis code. The performance I'm getting is very slow as it is only
> > able
> > > > to process ~150 documents/minute.
> > > > To test the performance without the analysis engine, I removed the AE
> > from
> > > > the pipeline but still I did not get any improvement in the
> processing
> > > > speeds. Can you please guide me as to where I might be going wrong or
> > what
> > > > I can do to improve the processing speeds?
> > > >
> > > > Thank you.
> > > > ________________________________
> > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > > > Teaching Excellence Framework Gold Award<
> > http://ehu.ac.uk/tef/emailfooter>
> > > > ________________________________
> > > > This message is private and confidential. If you have received this
> > > > message in error, please notify the sender and remove it from your
> > system.
> > > > Any views or opinions presented are solely those of the author and do
> > not
> > > > necessarily represent those of Edge Hill or associated companies.
> Edge
> > Hill
> > > > University may monitor email traffic data and also the content of
> > email for
> > > > the purposes of security and business communications during staff
> > absence.<
> > > > http://ehu.ac.uk/itspolicies/emailfooter>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message