Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 52792 invoked from network); 9 Jul 2002 15:02:37 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 9 Jul 2002 15:02:37 -0000 Received: (qmail 23426 invoked by uid 97); 9 Jul 2002 15:02:45 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 23385 invoked by uid 97); 9 Jul 2002 15:02:45 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 23373 invoked by uid 98); 9 Jul 2002 15:02:44 -0000 X-Antivirus: nagoya (v4198 created Apr 24 2002) Date: Tue, 9 Jul 2002 16:02:11 +0100 (BST) From: Keith Gunn To: Lucene Users List Subject: Re: PDF Text Stripper In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-MailScanner: Found to be clean X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N On Tue, 9 Jul 2002, Ben Litchfield wrote: > Hi, > > I have written a PDF library that can be used to strip text from PDF > documents. It is released under LGPL so have fun. > > There is one class which can be used to easily index PDF documents. > pdfparser.searchengine.lucene.LucenePDFDocument has a getDocument > method which will take a PDF file and return a Lucene Document which you > can add to an index. > > If you would like to see the quality of the text extraction you can run > pdfparser.Main from the command line which will take a PDF document and > write a txt file. > > I am looking for any input that you might have. Please mail me if you > have any bugs or feature requests. > > The library can be retrieved from > http://www.csh.rit.edu/~ben/projects/pdfparser/ > > -Ben Litchfield hi, I downloaded the zip and quickly ran the demo on a few files, it displays .notdef between words and there are spaces between every letter for words, is there code in your dist. to remove these so that just terms remain? Keith Gunn University Of Aberdeen -- To unsubscribe, e-mail: For additional commands, e-mail: