Return-Path: Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: (qmail 6225 invoked from network); 6 Dec 2009 22:04:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Dec 2009 22:04:40 -0000 Received: (qmail 49640 invoked by uid 500); 6 Dec 2009 22:04:40 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 49600 invoked by uid 500); 6 Dec 2009 22:04:39 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Delivered-To: moderator for users@pdfbox.apache.org Received: (qmail 51332 invoked by uid 99); 6 Dec 2009 15:06:03 -0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dpicella@gmail.com designates 209.85.212.180 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=IcyFsqGK0PJ5dM5+SSZR+3SRchMtYNfw/nBwj1VhCH8=; b=r7n/M9i9ybfP6zodMY7cTTNx/4D/0vcvbKgEjoIOZ7B9R3Vh+aOSY7Vl0uyNZrbonc tJXduWqkfgMEIZCLKoA/CFyU58ki6I58G1D1tQol13owaWfbXkVEU0ZNXOfL+RanWQ1A y8Y09IsSHvU4CxRH2o1yecIhAiZKzS7spW4R8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=eGr/RQjDvhq579X2h90fCx4+d8PVyJhsGWdzR5I6qe4mHjnOdRVn7XDKvLw+PQODh/ YrYeQLCl8tv6A4YVpR8DsRwzVdCaxfo15nW4cvL4YXCQd0vN9HsUYp+T3dBrvB+xBrB/ DvBLZUljEXXjDSjqCgbTnL+m9W9zVFpAW5huc= MIME-Version: 1.0 Date: Sun, 6 Dec 2009 07:05:32 -0800 Message-ID: <769303730912060705w19f9a24fv414917ef5eb7829f@mail.gmail.com> Subject: pdfboxpreparator and Regain From: David Picella To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=0016e68ee99718370d047a10ac05 X-Virus-Checked: Checked by ClamAV on apache.org --0016e68ee99718370d047a10ac05 Content-Type: text/plain; charset=ISO-8859-1 I'm using the Regain search engine powered by Lucene It has integration with pdfbox using a special indexing preparator called PdfBoxPreparator. Does anyone know if PdfBoxPreparator will extract data from the title, author, and keyword sections of the pdf? Also, what pdf versions are compatible? Thank you! Here is a post in the Regain forum that I submitted, but I have not heard anything. http://forum.murfman.de/en/viewtopic.php?f=3&t=1216 I am saving PDFs on my system that are "scanned" and therefore there is no text available in the body. I am looking for a good way to find these and I was thinking that I could do so by editing the title, keywords, and author lines in the PDF. -- David --0016e68ee99718370d047a10ac05--