uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lou DeGenaro <lou.degen...@gmail.com>
Subject Re: UIMA analysis from a database
Date Fri, 15 Sep 2017 14:49:55 GMT
Apache DUCC is simple to install and use.  It is designed to run scaled-out
(multi-threaded) UIMA pipelines as well as supporting services (Python,
Java, what-have-you).  It distributes work of multiple users fairly across
large or small machine clusters, including fitting multiple jobs on the
same large machine using cgroups!  Try Apache DUCC and complain if is does
not meet your needs.

There is a live demo accessible from the Apache UIMA web site
https://uima.apache.org/ (see Scaleout Frameworks).

Lou.

On Fri, Sep 15, 2017 at 3:28 AM, Nicolas Paris <niparisco@gmail.com> wrote:

> - Spark is simpler to learn than UIMA-AS (at least I don't know DUCC).
> - Spark is more generalist and can be used in other projects; for eg. I
>   have used the same design to transform pdf->text with apache pdfbox.
> - Spark can benefit from yarn or mesos job manager, on more 10K
>   computer
> - Spark benefits from hadoop hdfs distributed storage
> - Spark benefits from new optimized data format such Avro, a very robust
>   , and distributed format binary format
> - spark processes partitioned data and write to disk as batch (faster
>   than one by one)
> - Spark only instanciate one UIMA pipeline per partition, passes all its
>   text over, with nice performances
> - Spark can use (python/java/scala/R/Julia) for preprocessing texts and
>   then send the result to UIMA
> - Spark does have connector for databases or interfaces well with apache
>   sqoop, to fetch data from relational database in parrallel, very
>   easily
> - Spark has native machine learning tooling, and can be extended with
>   python or R ones.
>
>
>
>
> - UIMA-AS is another way to program UIMA
> - UIMA-FIT is complicated
> - UIMA-FIT only work with UIMA
> - UIMA only focuses on text Annotation
> - UIMA is not good at:
>         - text transformation
>         - read data from source in parallel
>         - write data to folder in parallel
>         - machine learning interface
>
>
> The only difficult part have been adressed : make it working, and you
> can read my messy repository to begin
>
> Le 15 sept. 2017 à 04:28, Osborne, John D écrivait :
> > Hi Nicolas,
> >
> > I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it
> because you are more familiar with Spark or were their other reasons?
> >
> > I have been using UIMA-AS, I am currently experimenting with DUCC and
> would love to hear your thoughts on the matter.
> >
> >  -John
> >
> >
> > ________________________________________
> > From: Nicolas Paris [niparisco@gmail.com]
> > Sent: Thursday, September 14, 2017 5:32 PM
> > To: user@uima.apache.org
> > Subject: Re: UIMA analysis from a database
> >
> > Hi Benedict
> >
> > Not sure this is helpful for you, but only an advice.
> > I recommend usint UIMA for what it is first intended : nlp pipeline.
> >
> > When dealing with multi threaded application, I would go for dedicated
> > technologies.
> >
> > I have been successfuly using UIMA together with apache spark. While this
> > design works well on a single computer, I am now able to distribute UIMA
> > pipeline over dosen of them, withou extra need.
> >
> > Then I focus on UIMA pipeline doing it's job well, and after testing,
> > industrialize them over spark.
> >
> > Advantages of this design are:
> > - benefit from spark distributing expertise (note failure, memory
> >   consumption, data partitionning...)
> > - simplify UIMA programming (no multithread inside, only NLP stuff)
> > - scale when needed (add more chip computer, get better performances)
> > - get expertise with spark, and use it with any java code you d'like
> > - spark do have JDBC connectors and may be able to fetch data in
> >   multithread easily.
> >
> > you can have an wotking example in my repo
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
> com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-
> De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_
> ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=
> RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e=
> > This have not been simple to make it working, but I can tell know this
> > methods is robust and optimized.
> >
> >
> > Le 14 sept. 2017 à 21:24, Benedict Holland écrivait :
> > > Hello everyone,
> > >
> > > I am trying to get my project off the ground and hit a small problem.
> > >
> > > I want to read text from a large database (lets say, 100,000+ rows).
> Each
> > > row will have a text article. I want to connect to the database,
> request a
> > > single row from the database, and process this document through an NLP
> > > engine and I want to do this in parallel. Each document will be say,
> split
> > > up into sentences and each sentence will be POS tagged.
> > >
> > > After reading the documentation, I am more confused than when I
> started. I
> > > think I want something like the FileSystemCollectionReader example and
> > > build a CPE. Instead of reading from the file system, it will read
> from the
> > > database.
> > >
> > > There are two problems with this approach:
> > >
> > > 1. I am not sure it is multi threaded: CAS initializers are deprecated
> and
> > > it appears that the getNext() method will only run in a single thread.
> > > 2. The FileSystemCollectionReader loads references to the file location
> > > into memory but not the text itself.
> > >
> > > For problem 1, the line I find very troubling is
> > >
> > > File file = (File) mFiles.get(mCurrentIndex++);
> > >
> > > I have to assume from this line that the CollectionReader_ImplBase is
> not
> > > multi-threaded but is intended to rapidly iterate over a set of
> documents
> > > in a single thread.
> > >
> > > Problem 2 is easily solved as I can create a massive array of integers
> if I
> > > feel like.
> > >
> > > Anyway, after deciding that this is not likely the solution, I looked
> into
> > > Multi-view Sofa annotators. I don't think these do what I want either.
> In
> > > this context, I would treat the database table as a single object with
> many
> > > "views" being chunks of rows. I don't think this works, based on the
> > > SofaExampleAnnotator code provided. It also appears to run in a single
> > > thread.
> > >
> > > This leaves me with CAS pools. I know that this is doing to be
> > > multi-threaded. I believe I create however many CAS objects from the
> > > annotator I want, probably an aggregate annotator. Is this correct and
> am I
> > > on the right track with CAS Pools?
> > >
> > > Thank you so much,
> > > ~Ben
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message