From user-return-8360-archive-asf-public=cust-asf.ponee.io@uima.apache.org Sun Jun 14 23:06:26 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B34BE180181 for ; Mon, 15 Jun 2020 01:06:25 +0200 (CEST) Received: (qmail 47596 invoked by uid 500); 14 Jun 2020 23:06:24 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 47580 invoked by uid 99); 14 Jun 2020 23:06:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Jun 2020 23:06:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8EB51C1410 for ; Sun, 14 Jun 2020 23:06:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.019 X-Spam-Level: X-Spam-Status: No, score=-0.019 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, KAM_SHORT=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 75-YzetvIvRw for ; Sun, 14 Jun 2020 23:06:19 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.218.45; helo=mail-ej1-f45.google.com; envelope-from=eaepstein@gmail.com; receiver= Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 39024BB8F9 for ; Sun, 14 Jun 2020 23:06:19 +0000 (UTC) Received: by mail-ej1-f45.google.com with SMTP id l27so15473891ejc.1 for ; Sun, 14 Jun 2020 16:06:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=ztt8ykYQO2RmVsVYUXdplSBhVbwpW6X2mVA5myxDGZc=; b=P5WGq9XBf2LedQnDqA3ThryrRMAtQEQFu8n2Rc060i6fkoDV5vBTbRDL9mn8CXrbcc rdWZld51hrBbSbgoqN4R3z0Oe0YiJsYvyTiFCy9gHJKQmOzoUHin9Lve6PfO77Iufda0 mfuTeyvn9O2fvH8VFlZA5VSq5NtXu7oE5oO9B6EmBpg6jwrXZGgSEhS/PwtgW5hbPJVm VFlV7WMtJWHUnLQkqi7nzTrlDKCASmwgwCVuSOTJWOglmpg7CUR32nImjRL5KknhIL1h Y1Gyy1fn936TuG2nNLNRgQ0M+N5XhhJflgzgY+V6Sa/Cjx+kkTXWjtG1gFXqZo/zLpy/ R/gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ztt8ykYQO2RmVsVYUXdplSBhVbwpW6X2mVA5myxDGZc=; b=uIeRy6vlhjMiZwoW6NOPCVAv2+4ZLp5PWJ48xmq29h7Eir5qMVifT0Xh3cTE6alulp GDFd0SO20bK7Db68O2u5Lk74nFMklPaYFJIMiY3wjJ38/lbQX7Vf57UQC6KaIR/aWvRl MH8xTJNpwKkzUtNoBq4HYGN37GFseEo3KEDLQes9KpGT6H0B8K5cz6HvTSKpruX3+ehd mTOxSV2RLROyjeTod1RFl1Bs0Cjsg41oiywCi52o8v1jj+QFeJpaAPEWNpD5C951k97e lju4TYy2vwTSyrw4BC9S1hlJMzVy7jdB6aan4In78MnMbAsusUS0qFVrPB0wHCh4iS48 PATw== X-Gm-Message-State: AOAM5330SVaM9+UzUHIDaG9+mXs7WASnYeBy0b+uhpw4HWSnZE4794V4 1UazxyOEh+JskkrrXL55+OCZm7bvW5naj+wibPu7xw== X-Google-Smtp-Source: ABdhPJzNVv/8EZiQvXsIFBxixNbw8wAO2Cyv+fcHGkinD77H0BEmo8466s5KDzln4a0In3LAezLxRfYIxIldNF+c1gk= X-Received: by 2002:a17:906:1f4f:: with SMTP id d15mr22871864ejk.206.1592175977920; Sun, 14 Jun 2020 16:06:17 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Eddie Epstein Date: Sun, 14 Jun 2020 19:06:07 -0400 Message-ID: Subject: Re: UIMA DUCC slow processing To: user@uima.apache.org Content-Type: multipart/alternative; boundary="000000000000957a9005a8135b69" --000000000000957a9005a8135b69 Content-Type: text/plain; charset="UTF-8" In this case the problem is not DUCC, rather it is the high overhead of opening small files and sending them to a remote computer individually. I/O works much more efficiently with larger blocks of data. Many small files can be merged into larger files using zip archives. DUCC sample code shows how to do this for CASes, and very similar code could be used for input documents as well. Implementing efficient scale out is highly dependent on good treatment of input and output data. Best, Eddie On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman < raja.m.sulaiman@gmail.com> wrote: > Hello, > > Thank you very much for your response and even more so for the detailed > explanation. > > So, if I understand it correctly, DUCC is more suited for scenarios where > we have large input documents rather than many small ones? > > Thank you once again. > > On Fri, 12 Jun 2020, 22:18 Eddie Epstein, wrote: > > > Hi, > > > > In this simple scenario there is a CollectionReader running in a > JobDriver > > process, delivering 100K workitems to multiple remote JobProcesses. The > > processing time is essentially zero. (30 * 60 seconds) / 100,000 > workitems > > = 18 milliseconds per workitem. This time is roughly the expected > overhead > > of a DUCC jobDriver delivering workitems to remote JobProcesses and > > recording the results. DUCC jobs are much more efficient if the overhead > > per workitem is much smaller than the processing time. > > > > Typically DUCC jobs would be processing much larger blocks of content per > > workitem. For example, if a workitem was a document, and the document > > parsed into the small CASes by the CasMultiplier, the throughput would be > > much better. However, with this example, as the number of working > > JobProcess threads is scaled up, the CR (JobDriver) would become a > > bottleneck. That's why a typical DUCC Job will not send the Document > > content as a workitem, but rather send a reference to the workitem > content > > and have the CasMultipliers in the JobProcesses read the content directly > > from the source. > > > > Even though content read by the JobProcesses is much more efficient, as > > scaleout continued to increase for this non-computation scenario the > > bottleneck would eventually move to the underlying filesystem or whatever > > document source and JobProcess output are. The main motivation for DUCC > was > > jobs similar to those in the DUCC examples which use OpenNLP to process > > large documents. That is, jobs where CPU processing is the bottleneck > > rather than I/O. > > > > Hopefully this helps. If not, happy to continue the discussion. > > Eddie > > > > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman < > > raja.m.sulaiman@gmail.com> wrote: > > > > > Hi, > > > Thank you for your reply and I'm sorry I couldn't get back to this > > > earlier. > > > > > > To get a better picture of the processing speed of DUCC, I made a dummy > > > pipeline where the CollectionReader runs a for loop to generate 100k > > > workitems (so no disk reads). each workitem only has a simple string in > > it. > > > These are then passed on to the CasMultiplier where for each workitem > I'm > > > creating a new CAS with DocumentInfo (again only having a simple string > > > value) and pass it as a newcas to the CasConsumer. The CasConsumer > > doesn't > > > do anything except add the Document received in the CAS to the logger. > So > > > basically this pipeline isn't doing anything, no Input reads and the > only > > > output is the information added to the logger. Running this on the > > cluster > > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more > > than > > > 30 minutes. I don't understand how is this possible since there's no > > heavy > > > I/O processing is happening in the code. > > > > > > Any ideas please? > > > > > > Thank you. > > > > > > On 2020/05/18 12:47:41, Eddie Epstein wrote: > > > > Hi, > > > > > > > > Removing the AE from the pipeline was a good idea to help isolate the > > > > bottleneck. The other two most likely possibilities are the > collection > > > > reader pulling from elastic search or the CAS consumer writing the > > > > processing output. > > > > > > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a > > > > cluster. Scaleout may be of limited or no value for I/O bound jobs. > > > > Please give a more complete picture of the processing scenario on > DUCC. > > > > > > > > Regards, > > > > Eddie > > > > > > > > > > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman < > > > > Sulemanr@edgehill.ac.uk> wrote: > > > > > > > > > Hi, > > > > > I've been trying to run a very small UIMA DUCC cluster with 2 slave > > > nodes > > > > > having 32GB of RAM each. I wrote a custom Collection Reader to read > > > data > > > > > from an Elasticsearch index and dump it into a new index after > > certain > > > > > analysis engine processing. The Analysis Engine is a simple > sentiment > > > > > analysis code. The performance I'm getting is very slow as it is > only > > > able > > > > > to process ~150 documents/minute. > > > > > To test the performance without the analysis engine, I removed the > AE > > > from > > > > > the pipeline but still I did not get any improvement in the > > processing > > > > > speeds. Can you please guide me as to where I might be going wrong > or > > > what > > > > > I can do to improve the processing speeds? > > > > > > > > > > Thank you. > > > > > ________________________________ > > > > > Edge Hill University > > > > > Teaching Excellence Framework Gold Award< > > > http://ehu.ac.uk/tef/emailfooter> > > > > > ________________________________ > > > > > This message is private and confidential. If you have received this > > > > > message in error, please notify the sender and remove it from your > > > system. > > > > > Any views or opinions presented are solely those of the author and > do > > > not > > > > > necessarily represent those of Edge Hill or associated companies. > > Edge > > > Hill > > > > > University may monitor email traffic data and also the content of > > > email for > > > > > the purposes of security and business communications during staff > > > absence.< > > > > > http://ehu.ac.uk/itspolicies/emailfooter> > > > > > > > > > > > > > > > --000000000000957a9005a8135b69--