Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 8470 invoked from network); 15 May 2008 15:49:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 May 2008 15:49:36 -0000 Received: (qmail 54731 invoked by uid 500); 15 May 2008 15:49:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 54698 invoked by uid 500); 15 May 2008 15:49:29 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 54687 invoked by uid 99); 15 May 2008 15:49:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 May 2008 08:49:29 -0700 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of janssen@parc.com designates 13.1.64.93 as permitted sender) Received: from [13.1.64.93] (HELO alpha.xerox.com) (13.1.64.93) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 May 2008 15:48:39 +0000 Received: from synergy1.parc.xerox.com ([13.1.101.60]) by alpha.xerox.com with SMTP id <170397(1)>; Thu, 15 May 2008 08:48:47 PDT Received: from parc.com ([127.0.0.1]) by synergy1.parc.xerox.com with SMTP id <58696>; Thu, 15 May 2008 08:48:38 PDT To: java-user@lucene.apache.org Subject: Re: text extraction from pdf In-reply-to: <1bcb7c7f0805150323k2ba0b8e8ic814efcaf881c33b@mail.gmail.com> References: <1bcb7c7f0805140231j6bb6c1dn323794cd739bfd00@mail.gmail.com> <482ACE49.9030507@getopt.org> <-5037462568277521439@unknownmsgid> <1bcb7c7f0805150323k2ba0b8e8ic814efcaf881c33b@mail.gmail.com> Comments: In-reply-to "Cam Bazz" message dated "Thu, 15 May 2008 03:23:34 -0700." Date: Thu, 15 May 2008 08:48:30 PDT From: Bill Janssen Message-Id: <08May15.084838pdt."58696"@synergy1.parc.xerox.com> X-Virus-Checked: Checked by ClamAV on apache.org > Problem I am having is that some of them has multiple columns. and multiple > word boxes. Does the xpdf patch extract different columns and wordboxes? It tells you where each word is. Columns you have to do for yourself. Bill > > In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and > > font information for each word. You can get the xpdf sources from > > http://www.foolabs.com/xpdf/, and the patch file is at > > http://uplib.parc.com/misc/xpdf-3.02-PATCH. To extract the byte > > positions, use pdftotext with the "-wordboxes" switch, and see the > > pdftotext man page for more info. This is run automatically in UpLib > > before the indexing is done. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org