uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: Analysing archive PDFs
Date Thu, 19 Feb 2015 20:51:00 GMT
On 19.02.2015, at 21:28, Philippe de Rochambeau <phiroc@free.fr> wrote:

> Hello,
> 
> In the past few months, I have indexed tens of thousands of PDFs containing newspaper
articles from 1887 until 1940 using SOLR for my company.
> 
> Every day, my colleagues in the Archive Department spend hours searching through the
archives using SOLR, looking for potentially-interesting articles from a social and historical
point of view.
> 
> Can UIMA or OpenNLP be used to automate their work and/or to analyze patterns in the
data?

I'd say that depends quite a bit on what kind of information your colleagues search for.
UIMA itself is just a framework to support unstructured information analysis. It does not
actually analyze text - that is the job of UIMA components. There are many UIMA components
for various kinds of tasks, in particular for natural language processing task. 

OpenNLP provides tools for basic linguistic analysis of texts such as part-of-speech tagging,
parsing, named entity recognition. OpenNLP provides some UIMA components. However, to use
OpenNLP effectively, you need to train models for it. Most models available for download from
the OpenNLP website give suboptimal results because they are trained only on small data sets.

If you look for patterns, then UIMA Ruta might help. You can implement patterns to detect
and 
analyze certain kinds of information, e.g. bibliographic records or information from a CV.

Apart from what Apache UIMA has to offer, I these pointers might also be interesting to you:


Topic modelling is a trending technology with respect to sieving through data and detecting
interesting things. There are many recent research publications on this topic. 

This video [1] recently twittered by me, so I might as well share it here.

A colleague of mine uses topic models to analyze historical school books [2]. As part of this,
we also built UIMA components in DKPro Core [3] to generate topic models using the Mallet
library [4].

Cheers,

-- Richard

[1] http://nycdatascience.com/news/using-machine-learning-to-aid-journalism-at-the-new-york-times/
[2] https://www.ukp.tu-darmstadt.de/research/current-projects/welt-der-kinder/
[3] https://dkpro-core-asl.googlecode.com
[4] http://mallet.cs.umass.edu



Mime
View raw message