uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: UIMA analysis from a database
Date Fri, 15 Sep 2017 07:28:50 GMT
- Spark is simpler to learn than UIMA-AS (at least I don't know DUCC).
- Spark is more generalist and can be used in other projects; for eg. I
  have used the same design to transform pdf->text with apache pdfbox.
- Spark can benefit from yarn or mesos job manager, on more 10K
- Spark benefits from hadoop hdfs distributed storage
- Spark benefits from new optimized data format such Avro, a very robust
  , and distributed format binary format
- spark processes partitioned data and write to disk as batch (faster
  than one by one)
- Spark only instanciate one UIMA pipeline per partition, passes all its
  text over, with nice performances
- Spark can use (python/java/scala/R/Julia) for preprocessing texts and
  then send the result to UIMA
- Spark does have connector for databases or interfaces well with apache
  sqoop, to fetch data from relational database in parrallel, very
- Spark has native machine learning tooling, and can be extended with
  python or R ones.

- UIMA-AS is another way to program UIMA
- UIMA-FIT is complicated
- UIMA-FIT only work with UIMA
- UIMA only focuses on text Annotation
- UIMA is not good at:
	- text transformation
	- read data from source in parallel
	- write data to folder in parallel
	- machine learning interface

The only difficult part have been adressed : make it working, and you
can read my messy repository to begin

Le 15 sept. 2017 à 04:28, Osborne, John D écrivait :
> Hi Nicolas,
> I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it because you
are more familiar with Spark or were their other reasons?
> I have been using UIMA-AS, I am currently experimenting with DUCC and would love to hear
your thoughts on the matter.
>  -John
> ________________________________________
> From: Nicolas Paris [niparisco@gmail.com]
> Sent: Thursday, September 14, 2017 5:32 PM
> To: user@uima.apache.org
> Subject: Re: UIMA analysis from a database
> Hi Benedict
> Not sure this is helpful for you, but only an advice.
> I recommend usint UIMA for what it is first intended : nlp pipeline.
> When dealing with multi threaded application, I would go for dedicated
> technologies.
> I have been successfuly using UIMA together with apache spark. While this
> design works well on a single computer, I am now able to distribute UIMA
> pipeline over dosen of them, withou extra need.
> Then I focus on UIMA pipeline doing it's job well, and after testing,
> industrialize them over spark.
> Advantages of this design are:
> - benefit from spark distributing expertise (note failure, memory
>   consumption, data partitionning...)
> - simplify UIMA programming (no multithread inside, only NLP stuff)
> - scale when needed (add more chip computer, get better performances)
> - get expertise with spark, and use it with any java code you d'like
> - spark do have JDBC connectors and may be able to fetch data in
>   multithread easily.
> you can have an wotking example in my repo https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e=
> This have not been simple to make it working, but I can tell know this
> methods is robust and optimized.
> Le 14 sept. 2017 à 21:24, Benedict Holland écrivait :
> > Hello everyone,
> >
> > I am trying to get my project off the ground and hit a small problem.
> >
> > I want to read text from a large database (lets say, 100,000+ rows). Each
> > row will have a text article. I want to connect to the database, request a
> > single row from the database, and process this document through an NLP
> > engine and I want to do this in parallel. Each document will be say, split
> > up into sentences and each sentence will be POS tagged.
> >
> > After reading the documentation, I am more confused than when I started. I
> > think I want something like the FileSystemCollectionReader example and
> > build a CPE. Instead of reading from the file system, it will read from the
> > database.
> >
> > There are two problems with this approach:
> >
> > 1. I am not sure it is multi threaded: CAS initializers are deprecated and
> > it appears that the getNext() method will only run in a single thread.
> > 2. The FileSystemCollectionReader loads references to the file location
> > into memory but not the text itself.
> >
> > For problem 1, the line I find very troubling is
> >
> > File file = (File) mFiles.get(mCurrentIndex++);
> >
> > I have to assume from this line that the CollectionReader_ImplBase is not
> > multi-threaded but is intended to rapidly iterate over a set of documents
> > in a single thread.
> >
> > Problem 2 is easily solved as I can create a massive array of integers if I
> > feel like.
> >
> > Anyway, after deciding that this is not likely the solution, I looked into
> > Multi-view Sofa annotators. I don't think these do what I want either. In
> > this context, I would treat the database table as a single object with many
> > "views" being chunks of rows. I don't think this works, based on the
> > SofaExampleAnnotator code provided. It also appears to run in a single
> > thread.
> >
> > This leaves me with CAS pools. I know that this is doing to be
> > multi-threaded. I believe I create however many CAS objects from the
> > annotator I want, probably an aggregate annotator. Is this correct and am I
> > on the right track with CAS Pools?
> >
> > Thank you so much,
> > ~Ben

View raw message