Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9A3A1200CE6 for ; Fri, 15 Sep 2017 09:29:02 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9899A1609D1; Fri, 15 Sep 2017 07:29:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B69251609CF for ; Fri, 15 Sep 2017 09:29:01 +0200 (CEST) Received: (qmail 72325 invoked by uid 500); 15 Sep 2017 07:28:55 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 72310 invoked by uid 99); 15 Sep 2017 07:28:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Sep 2017 07:28:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DD34A184739 for ; Fri, 15 Sep 2017 07:28:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.9 X-Spam-Level: X-Spam-Status: No, score=-2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id BGRPnqlLL5ZN for ; Fri, 15 Sep 2017 07:28:53 +0000 (UTC) Received: from mail-wr0-f175.google.com (mail-wr0-f175.google.com [209.85.128.175]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E72345FB57 for ; Fri, 15 Sep 2017 07:28:52 +0000 (UTC) Received: by mail-wr0-f175.google.com with SMTP id v109so1079831wrc.1 for ; Fri, 15 Sep 2017 00:28:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to:priority :user-agent; bh=IB7hKd2UAvovT067FRN97k+c6cqS/iDrn7M2EvijvYs=; b=lRxlroGjXaNzM4JQhvYHffbD+pKoT3HhPw1fSnlrXgDPyvD7yvH6ViyzVuhfpe53J9 XEbF5JxassqqLl4HczrF9AYZCdYUWLh8oQ7UgECupDHNNnbGMAJelf7m0c4sglYVTqnS bc6KW6B3T3gzKCgrefgBZZCCboD/wm+KZF+K/C7YEbXAdYa6V9+bBe803Ui0no7Aw0Mr Wg5WvutfX2a/2URqd4VjeceioVIZrvG/MNjThgGcPdtxYRSAg6FfXSMlxsr6cnt5Yi8U yTwS3iDgQpl4DSGKyA7zWMQAY7ZzUdy0nePAOGx9QrN/wE1NysiCMu6PIAzWSsDG+Tbj DRCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:priority:user-agent; bh=IB7hKd2UAvovT067FRN97k+c6cqS/iDrn7M2EvijvYs=; b=lfQnDhEX9Q4dcr2lG9fAHyY2sIGYq2KKjEit38zd2yXn8tCmSOGHi+Ng6j73Panoom 3nNjP4yD9XZCMr6MDYVDC2mJQ8V5Fc1fGwoUnKQRJXVRlCWWOLnro8b2CcYqyynW0FgB 6oTpML5tVdVStReYMF80ugkX0pK/Xfwa5twyi/YY0/++hT7GseoA54DVzWJKj9iF3hTJ QQZSkO0wYsoJ7zS6QNrGo8+YONIhil1/P5Bw0szqwQ+3CtSfJOwU4RZ7gDwFdb9n+L33 PmBIW++UcDKi2qJBs6dRVQFYqA2l5lA3IKlUu5LNaeUK9FFdfn+TqxqDAOHbr8LUHKqr Go2w== X-Gm-Message-State: AHPjjUgGNtBqnQO8T8iR4+L+DZ4k6ukBp/2UdhAfpVEfTPV62cWiAwDp gv4Pk1ven1pCFluR X-Google-Smtp-Source: ADKCNb474ZNlULEDJcumV2HmHpGhfJk8TqYCgU33OQb4UddVS7yvkNcuIaVLKP8QBpJeAXa+hSCnZg== X-Received: by 10.223.171.167 with SMTP id s36mr21826295wrc.256.1505460531489; Fri, 15 Sep 2017 00:28:51 -0700 (PDT) Received: from localhost (81-66-114-185.rev.numericable.fr. [81.66.114.185]) by smtp.gmail.com with ESMTPSA id r15sm107757wrc.30.2017.09.15.00.28.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Sep 2017 00:28:50 -0700 (PDT) Date: Fri, 15 Sep 2017 09:28:50 +0200 From: Nicolas Paris To: user@uima.apache.org Subject: Re: UIMA analysis from a database Message-ID: <20170915072850.aapcr64x2ht6ejqf@gmail.com> References: <20170914223256.av3wvexy4kfiev7f@gmail.com> <26048294999DF540B2E62368A7B4A2D00269529652@MSGHL5A13.ad.hs.uab.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <26048294999DF540B2E62368A7B4A2D00269529652@MSGHL5A13.ad.hs.uab.edu> Priority: normal User-Agent: NeoMutt/20170714-155-820ab5 (1.8.3) archived-at: Fri, 15 Sep 2017 07:29:02 -0000 - Spark is simpler to learn than UIMA-AS (at least I don't know DUCC). - Spark is more generalist and can be used in other projects; for eg. I have used the same design to transform pdf->text with apache pdfbox. - Spark can benefit from yarn or mesos job manager, on more 10K computer - Spark benefits from hadoop hdfs distributed storage - Spark benefits from new optimized data format such Avro, a very robust , and distributed format binary format - spark processes partitioned data and write to disk as batch (faster than one by one) - Spark only instanciate one UIMA pipeline per partition, passes all its text over, with nice performances - Spark can use (python/java/scala/R/Julia) for preprocessing texts and then send the result to UIMA - Spark does have connector for databases or interfaces well with apache sqoop, to fetch data from relational database in parrallel, very easily - Spark has native machine learning tooling, and can be extended with python or R ones. - UIMA-AS is another way to program UIMA - UIMA-FIT is complicated - UIMA-FIT only work with UIMA - UIMA only focuses on text Annotation - UIMA is not good at: - text transformation - read data from source in parallel - write data to folder in parallel - machine learning interface The only difficult part have been adressed : make it working, and you can read my messy repository to begin Le 15 sept. 2017 � 04:28, Osborne, John D �crivait : > Hi Nicolas, > > I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it because you are more familiar with Spark or were their other reasons? > > I have been using UIMA-AS, I am currently experimenting with DUCC and would love to hear your thoughts on the matter. > > -John > > > ________________________________________ > From: Nicolas Paris [niparisco@gmail.com] > Sent: Thursday, September 14, 2017 5:32 PM > To: user@uima.apache.org > Subject: Re: UIMA analysis from a database > > Hi Benedict > > Not sure this is helpful for you, but only an advice. > I recommend usint UIMA for what it is first intended : nlp pipeline. > > When dealing with multi threaded application, I would go for dedicated > technologies. > > I have been successfuly using UIMA together with apache spark. While this > design works well on a single computer, I am now able to distribute UIMA > pipeline over dosen of them, withou extra need. > > Then I focus on UIMA pipeline doing it's job well, and after testing, > industrialize them over spark. > > Advantages of this design are: > - benefit from spark distributing expertise (note failure, memory > consumption, data partitionning...) > - simplify UIMA programming (no multithread inside, only NLP stuff) > - scale when needed (add more chip computer, get better performances) > - get expertise with spark, and use it with any java code you d'like > - spark do have JDBC connectors and may be able to fetch data in > multithread easily. > > you can have an wotking example in my repo https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e= > This have not been simple to make it working, but I can tell know this > methods is robust and optimized. > > > Le 14 sept. 2017 � 21:24, Benedict Holland �crivait : > > Hello everyone, > > > > I am trying to get my project off the ground and hit a small problem. > > > > I want to read text from a large database (lets say, 100,000+ rows). Each > > row will have a text article. I want to connect to the database, request a > > single row from the database, and process this document through an NLP > > engine and I want to do this in parallel. Each document will be say, split > > up into sentences and each sentence will be POS tagged. > > > > After reading the documentation, I am more confused than when I started. I > > think I want something like the FileSystemCollectionReader example and > > build a CPE. Instead of reading from the file system, it will read from the > > database. > > > > There are two problems with this approach: > > > > 1. I am not sure it is multi threaded: CAS initializers are deprecated and > > it appears that the getNext() method will only run in a single thread. > > 2. The FileSystemCollectionReader loads references to the file location > > into memory but not the text itself. > > > > For problem 1, the line I find very troubling is > > > > File file = (File) mFiles.get(mCurrentIndex++); > > > > I have to assume from this line that the CollectionReader_ImplBase is not > > multi-threaded but is intended to rapidly iterate over a set of documents > > in a single thread. > > > > Problem 2 is easily solved as I can create a massive array of integers if I > > feel like. > > > > Anyway, after deciding that this is not likely the solution, I looked into > > Multi-view Sofa annotators. I don't think these do what I want either. In > > this context, I would treat the database table as a single object with many > > "views" being chunks of rows. I don't think this works, based on the > > SofaExampleAnnotator code provided. It also appears to run in a single > > thread. > > > > This leaves me with CAS pools. I know that this is doing to be > > multi-threaded. I believe I create however many CAS objects from the > > annotator I want, probably an aggregate annotator. Is this correct and am I > > on the right track with CAS Pools? > > > > Thank you so much, > > ~Ben