Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 11426 invoked from network); 21 Mar 2002 01:26:58 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 21 Mar 2002 01:26:58 -0000 Received: (qmail 12989 invoked by uid 97); 21 Mar 2002 01:27:04 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 12973 invoked by uid 97); 21 Mar 2002 01:27:04 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 12962 invoked from network); 21 Mar 2002 01:27:03 -0000 Message-ID: <06ed01c1d077$ecfa5940$0b01a8c0@168.1.8.Domainrelevanz> Reply-To: "Kelvin Tan" From: "Kelvin Tan" To: Cc: Subject: [OT] Extracting text from PDF via Etymon Pj Date: Thu, 21 Mar 2002 09:30:01 +0800 Organization: Relevanz Pte Ltd MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I've received a couple of private mails from users on how to extract text from PDF files using the Etymon lib. I thought I'd just post it for the archives in case anyone's interested. If you still need help just holler! The references to cat are Log4j's category. You can remove it without side-effects if you don't use Log4j. private String getContent(Pdf pdf, int pageNo) { String content = null; PjStream stream = null; StringBuffer strbf = new StringBuffer(); try { PjPage page = (PjPage) pdf.getObject(pdf.getPage(pageNo)); PjObject pobj = (PjObject) pdf.resolve(page.getContents()); if (pobj instanceof PjArray) { PjArray array = (PjArray) pobj; Vector vArray = array.getVector(); int size = vArray.size(); for (int j = 0; j < size; j++) { stream = (PjStream) pdf.resolve((PjObject) vArray.get(j)); strbf.append(getStringFromPjStream(stream)); } content = strbf.toString(); } else { stream = (PjStream) pobj; content = getStringFromPjStream(stream); } } catch (InvalidPdfObjectException pdfe) { cat.error("Invalid PDF Object:" + pdfe, pdfe); } catch (Exception e) { cat.error("Exception in getContent() " + e, e); } return content; } private String getStringFromPjStream(PjStream stream) { StringBuffer strbf = new StringBuffer(); try { int start,end = 0; stream = stream.flateDecompress(); String longString = stream.toString(); int strlen = longString.length(); int lastIndex = longString.lastIndexOf(")"); while (lastIndex != -1 && end != lastIndex) { start = longString.indexOf("(", end); end = longString.indexOf(")", start); String text = longString.substring(start + 1, end); strbf.append(text); } } catch (InvalidPdfObjectException pdfe) { cat.error("InvalidObjectException:" + pdfe.getMessage(), pdfe); } return strbf.toString(); } Good luck! Regards, Kelvin Regards, Kelvin Tan Relevanz Pte Ltd http://www.relevanz.com 180B Bencoolen St. The Bencoolen, #04-01 S(189648) Tel: 238 6229 Fax: 337 4417 -- To unsubscribe, e-mail: For additional commands, e-mail: