Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Date: Fri, 15 Sep 2017 09:28:50 +0200
From: Nicolas Paris <niparisco@gmail.com>
To: user@uima.apache.org
Subject: Re: UIMA analysis from a database
Message-ID: <20170915072850.aapcr64x2ht6ejqf@gmail.com>
References: <CAD+mzoyuLbwcyZ9cp8z3dd5Q_Vk01cD2WKFxtJM8PDLEd6=ZQQ@mail.gmail.com>
 <20170914223256.av3wvexy4kfiev7f@gmail.com>
 <26048294999DF540B2E62368A7B4A2D00269529652@MSGHL5A13.ad.hs.uab.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <26048294999DF540B2E62368A7B4A2D00269529652@MSGHL5A13.ad.hs.uab.edu>
Priority: normal
User-Agent: NeoMutt/20170714-155-820ab5 (1.8.3)
archived-at: Fri, 15 Sep 2017 07:29:02 -0000

- Spark is simpler to learn than UIMA-AS (at least I don't know DUCC).
- Spark is more generalist and can be used in other projects; for eg. I
  have used the same design to transform pdf->text with apache pdfbox.
- Spark can benefit from yarn or mesos job manager, on more 10K
  computer
- Spark benefits from hadoop hdfs distributed storage
- Spark benefits from new optimized data format such Avro, a very robust
  , and distributed format binary format
- spark processes partitioned data and write to disk as batch (faster
  than one by one)
- Spark only instanciate one UIMA pipeline per partition, passes all its
  text over, with nice performances
- Spark can use (python/java/scala/R/Julia) for preprocessing texts and
  then send the result to UIMA
- Spark does have connector for databases or interfaces well with apache
  sqoop, to fetch data from relational database in parrallel, very
  easily
- Spark has native machine learning tooling, and can be extended with
  python or R ones.


- UIMA-AS is another way to program UIMA
- UIMA-FIT is complicated
- UIMA-FIT only work with UIMA
- UIMA only focuses on text Annotation
- UIMA is not good at:
	- text transformation
	- read data from source in parallel
	- write data to folder in parallel
	- machine learning interface


The only difficult part have been adressed : make it working, and you
can read my messy repository to begin

Le 15 sept. 2017 � 04:28, Osborne, John D �crivait :
> Hi Nicolas,
> 
> I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it because you are more familiar with Spark or were their other reasons?
> 
> I have been using UIMA-AS, I am currently experimenting with DUCC and would love to hear your thoughts on the matter.
> 
>  -John
> 
> 
> ________________________________________
> From: Nicolas Paris [niparisco@gmail.com]
> Sent: Thursday, September 14, 2017 5:32 PM
> To: user@uima.apache.org
> Subject: Re: UIMA analysis from a database
> 
> Hi Benedict
> 
> Not sure this is helpful for you, but only an advice.
> I recommend usint UIMA for what it is first intended : nlp pipeline.
> 
> When dealing with multi threaded application, I would go for dedicated
> technologies.
> 
> I have been successfuly using UIMA together with apache spark. While this
> design works well on a single computer, I am now able to distribute UIMA
> pipeline over dosen of them, withou extra need.
> 
> Then I focus on UIMA pipeline doing it's job well, and after testing,
> industrialize them over spark.
> 
> Advantages of this design are:
> - benefit from spark distributing expertise (note failure, memory
>   consumption, data partitionning...)
> - simplify UIMA programming (no multithread inside, only NLP stuff)
> - scale when needed (add more chip computer, get better performances)
> - get expertise with spark, and use it with any java code you d'like
> - spark do have JDBC connectors and may be able to fetch data in
>   multithread easily.
> 
> you can have an wotking example in my repo https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e=
> This have not been simple to make it working, but I can tell know this
> methods is robust and optimized.
> 
> 
> Le 14 sept. 2017 � 21:24, Benedict Holland �crivait :
> > Hello everyone,
> >
> > I am trying to get my project off the ground and hit a small problem.
> >
> > I want to read text from a large database (lets say, 100,000+ rows). Each
> > row will have a text article. I want to connect to the database, request a
> > single row from the database, and process this document through an NLP
> > engine and I want to do this in parallel. Each document will be say, split
> > up into sentences and each sentence will be POS tagged.
> >
> > After reading the documentation, I am more confused than when I started. I
> > think I want something like the FileSystemCollectionReader example and
> > build a CPE. Instead of reading from the file system, it will read from the
> > database.
> >
> > There are two problems with this approach:
> >
> > 1. I am not sure it is multi threaded: CAS initializers are deprecated and
> > it appears that the getNext() method will only run in a single thread.
> > 2. The FileSystemCollectionReader loads references to the file location
> > into memory but not the text itself.
> >
> > For problem 1, the line I find very troubling is
> >
> > File file = (File) mFiles.get(mCurrentIndex++);
> >
> > I have to assume from this line that the CollectionReader_ImplBase is not
> > multi-threaded but is intended to rapidly iterate over a set of documents
> > in a single thread.
> >
> > Problem 2 is easily solved as I can create a massive array of integers if I
> > feel like.
> >
> > Anyway, after deciding that this is not likely the solution, I looked into
> > Multi-view Sofa annotators. I don't think these do what I want either. In
> > this context, I would treat the database table as a single object with many
> > "views" being chunks of rows. I don't think this works, based on the
> > SofaExampleAnnotator code provided. It also appears to run in a single
> > thread.
> >
> > This leaves me with CAS pools. I know that this is doing to be
> > multi-threaded. I believe I create however many CAS objects from the
> > annotator I want, probably an aggregate annotator. Is this correct and am I
> > on the right track with CAS Pools?
> >
> > Thank you so much,
> > ~Ben