uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: UIMA analysis from a database
Date Fri, 15 Sep 2017 07:31:04 GMT
Indeed I confused sometimes UIMA-FIT & AS  in previous email


Le 15 sept. 2017 à 09:28, Nicolas Paris écrivait :
> - Spark is simpler to learn than UIMA-AS (at least I don't know DUCC).
> - Spark is more generalist and can be used in other projects; for eg. I
>   have used the same design to transform pdf->text with apache pdfbox.
> - Spark can benefit from yarn or mesos job manager, on more 10K
>   computer
> - Spark benefits from hadoop hdfs distributed storage
> - Spark benefits from new optimized data format such Avro, a very robust
>   , and distributed format binary format
> - spark processes partitioned data and write to disk as batch (faster
>   than one by one)
> - Spark only instanciate one UIMA pipeline per partition, passes all its
>   text over, with nice performances
> - Spark can use (python/java/scala/R/Julia) for preprocessing texts and
>   then send the result to UIMA
> - Spark does have connector for databases or interfaces well with apache
>   sqoop, to fetch data from relational database in parrallel, very
>   easily
> - Spark has native machine learning tooling, and can be extended with
>   python or R ones.
> 
> 
> 
> 
> - UIMA-AS is another way to program UIMA
> - UIMA-FIT is complicated
> - UIMA-FIT only work with UIMA
> - UIMA only focuses on text Annotation
> - UIMA is not good at:
> 	- text transformation
> 	- read data from source in parallel
> 	- write data to folder in parallel
> 	- machine learning interface
> 
> 
> The only difficult part have been adressed : make it working, and you
> can read my messy repository to begin
> 
> Le 15 sept. 2017 à 04:28, Osborne, John D écrivait :
> > Hi Nicolas,
> > 
> > I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it because
you are more familiar with Spark or were their other reasons?
> > 
> > I have been using UIMA-AS, I am currently experimenting with DUCC and would love
to hear your thoughts on the matter.
> > 
> >  -John
> > 
> > 
> > ________________________________________
> > From: Nicolas Paris [niparisco@gmail.com]
> > Sent: Thursday, September 14, 2017 5:32 PM
> > To: user@uima.apache.org
> > Subject: Re: UIMA analysis from a database
> > 
> > Hi Benedict
> > 
> > Not sure this is helpful for you, but only an advice.
> > I recommend usint UIMA for what it is first intended : nlp pipeline.
> > 
> > When dealing with multi threaded application, I would go for dedicated
> > technologies.
> > 
> > I have been successfuly using UIMA together with apache spark. While this
> > design works well on a single computer, I am now able to distribute UIMA
> > pipeline over dosen of them, withou extra need.
> > 
> > Then I focus on UIMA pipeline doing it's job well, and after testing,
> > industrialize them over spark.
> > 
> > Advantages of this design are:
> > - benefit from spark distributing expertise (note failure, memory
> >   consumption, data partitionning...)
> > - simplify UIMA programming (no multithread inside, only NLP stuff)
> > - scale when needed (add more chip computer, get better performances)
> > - get expertise with spark, and use it with any java code you d'like
> > - spark do have JDBC connectors and may be able to fetch data in
> >   multithread easily.
> > 
> > you can have an wotking example in my repo https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e=
> > This have not been simple to make it working, but I can tell know this
> > methods is robust and optimized.
> > 
> > 
> > Le 14 sept. 2017 à 21:24, Benedict Holland écrivait :
> > > Hello everyone,
> > >
> > > I am trying to get my project off the ground and hit a small problem.
> > >
> > > I want to read text from a large database (lets say, 100,000+ rows). Each
> > > row will have a text article. I want to connect to the database, request a
> > > single row from the database, and process this document through an NLP
> > > engine and I want to do this in parallel. Each document will be say, split
> > > up into sentences and each sentence will be POS tagged.
> > >
> > > After reading the documentation, I am more confused than when I started. I
> > > think I want something like the FileSystemCollectionReader example and
> > > build a CPE. Instead of reading from the file system, it will read from the
> > > database.
> > >
> > > There are two problems with this approach:
> > >
> > > 1. I am not sure it is multi threaded: CAS initializers are deprecated and
> > > it appears that the getNext() method will only run in a single thread.
> > > 2. The FileSystemCollectionReader loads references to the file location
> > > into memory but not the text itself.
> > >
> > > For problem 1, the line I find very troubling is
> > >
> > > File file = (File) mFiles.get(mCurrentIndex++);
> > >
> > > I have to assume from this line that the CollectionReader_ImplBase is not
> > > multi-threaded but is intended to rapidly iterate over a set of documents
> > > in a single thread.
> > >
> > > Problem 2 is easily solved as I can create a massive array of integers if I
> > > feel like.
> > >
> > > Anyway, after deciding that this is not likely the solution, I looked into
> > > Multi-view Sofa annotators. I don't think these do what I want either. In
> > > this context, I would treat the database table as a single object with many
> > > "views" being chunks of rows. I don't think this works, based on the
> > > SofaExampleAnnotator code provided. It also appears to run in a single
> > > thread.
> > >
> > > This leaves me with CAS pools. I know that this is doing to be
> > > multi-threaded. I believe I create however many CAS objects from the
> > > annotator I want, probably an aggregate annotator. Is this correct and am I
> > > on the right track with CAS Pools?
> > >
> > > Thank you so much,
> > > ~Ben

Mime
View raw message