Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 34888 invoked from network); 22 Nov 2002 14:42:01 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 22 Nov 2002 14:42:01 -0000 Received: (qmail 11257 invoked by uid 97); 22 Nov 2002 14:42:58 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 11227 invoked by uid 97); 22 Nov 2002 14:42:57 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 11188 invoked by uid 98); 22 Nov 2002 14:42:57 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <3C41B6C7D755D511BC5700005A98810C0347A08E@OFDMXS01> From: "Borkenhagen, Michael (ofd-ko zdfin)" To: "'Lucene Users List'" Subject: AW: PDF parser Date: Fri, 22 Nov 2002 15:41:52 +0100 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N There are different Parsers available - every Parser has other advantages and disadvantages. I use a combination of the PDFBox http://www.pdfbox.org/ and Etymon PJ http://www.etymon.com/pjc/, cause their APIs are very simple. Both of the= m parse PDF in a format of their own an provide interfaces to get the PDF Documents contents. Other developers on this list prefer JPedal http://www.jpedal.org/ which parses PDF into XML an provide a XML Tree with the PDF Documents contents= .=20 JPedal does the work best, but the Documentation isn=B4t very detailed. Micha -----Urspr=FCngliche Nachricht----- Von: Thomas Chacko [mailto:thomas@CrestecDigital.com] Gesendet: Freitag, 22. November 2002 15:26 An: Lucene Users List Betreff: PDF parser Whats the best parser available to extarct text from PDF documents. Expecting a reply ASAP Thanks in advance Thomas Chacko -- To unsubscribe, e-mail: For additional commands, e-mail: