Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 735B317331 for ; Thu, 19 Feb 2015 20:51:06 +0000 (UTC) Received: (qmail 37179 invoked by uid 500); 19 Feb 2015 20:51:06 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 37138 invoked by uid 500); 19 Feb 2015 20:51:06 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 37123 invoked by uid 99); 19 Feb 2015 20:51:06 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Feb 2015 20:51:06 +0000 Received: from [10.0.1.12] (ip-37-201-78-128.hsi13.unitymediagroup.de [37.201.78.128]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id B6A9E1A03DC; Thu, 19 Feb 2015 20:51:04 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: Analysing archive PDFs From: Richard Eckart de Castilho In-Reply-To: <61AF41D6-9964-4A41-9BD5-54FFDEE50211@free.fr> Date: Thu, 19 Feb 2015 21:51:00 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <668CD46B-9782-46D4-A15C-1584799373BD@apache.org> References: <61AF41D6-9964-4A41-9BD5-54FFDEE50211@free.fr> To: user@uima.apache.org X-Mailer: Apple Mail (2.1878.6) On 19.02.2015, at 21:28, Philippe de Rochambeau wrote: > Hello, >=20 > In the past few months, I have indexed tens of thousands of PDFs = containing newspaper articles from 1887 until 1940 using SOLR for my = company. >=20 > Every day, my colleagues in the Archive Department spend hours = searching through the archives using SOLR, looking for = potentially-interesting articles from a social and historical point of = view. >=20 > Can UIMA or OpenNLP be used to automate their work and/or to analyze = patterns in the data? I'd say that depends quite a bit on what kind of information your = colleagues search for. UIMA itself is just a framework to support unstructured information = analysis. It does not actually analyze text - that is the job of UIMA components. There are = many UIMA components for various kinds of tasks, in particular for natural language = processing task.=20 OpenNLP provides tools for basic linguistic analysis of texts such as = part-of-speech tagging, parsing, named entity recognition. OpenNLP provides some UIMA = components. However, to use OpenNLP effectively, you need to train models for it. Most models = available for download from the OpenNLP website give suboptimal results because they are trained = only on small data sets. If you look for patterns, then UIMA Ruta might help. You can implement = patterns to detect and=20 analyze certain kinds of information, e.g. bibliographic records or = information from a CV. Apart from what Apache UIMA has to offer, I these pointers might also be = interesting to you:=20 Topic modelling is a trending technology with respect to sieving through = data and detecting interesting things. There are many recent research publications on this = topic.=20 This video [1] recently twittered by me, so I might as well share it = here. A colleague of mine uses topic models to analyze historical school books = [2]. As part of this, we also built UIMA components in DKPro Core [3] to generate topic models = using the Mallet library [4]. Cheers, -- Richard [1] = http://nycdatascience.com/news/using-machine-learning-to-aid-journalism-at= -the-new-york-times/ [2] = https://www.ukp.tu-darmstadt.de/research/current-projects/welt-der-kinder/= [3] https://dkpro-core-asl.googlecode.com [4] http://mallet.cs.umass.edu