Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 96585 invoked from network); 24 Nov 2001 03:46:53 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 24 Nov 2001 03:46:53 -0000 Received: (qmail 4038 invoked by uid 97); 24 Nov 2001 03:47:01 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 4002 invoked by uid 97); 24 Nov 2001 03:47:00 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 3991 invoked from network); 24 Nov 2001 03:46:59 -0000 Message-ID: <012001c1749a$d55b4a80$0b01a8c0@168.1.8.Domainrelevanz> Reply-To: "Kelvin Tan" From: "Kelvin Tan" To: "Lucene Users List" References: <42257.202.140.147.60.1006504751.squirrel@mail.interactive1.com> <008f01c1740c$6f5d4e60$0b01a8c0@168.1.8.Domainrelevanz> Subject: Re: PDF parser for Lucene Date: Sat, 24 Nov 2001 11:48:08 +0800 Organization: Relevanz Pte Ltd MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2615.200 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2615.200 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Here's part of my email to Otis...with some additions at the bottom I was rather intrigued by Websearch's abilities and wanted to compare it with Pj's, so I ran both on a couple more PDFs and of a greater variety than I had prior to this. The results were pretty disappointing. Generally any PDF file that can be processed by Websearch can be done by Pj. Text is extracted and except for special characters (which are replaced by a \{code}). Whilst I had previously enjoyed relative success with Pj for extracting text from PDFs, there were many PDF files in which it just fell flat on its face. Probing further, this is what I found. If the PDF is encrypted, generally the text can't be extracted (pj has a method where you call getEncryptedDictionary() which apparently returns an encrypted dictionary if it is encrypted). If an encoding method other than ascii85 or flate is used, pj can't handle it (I've seen a LZWdecode used. I suppose this is Zip). And then there are other instances of which I haven't a clue...:) As a rule of thumb, if the PDF is all text (unpractical of course, and defeating the entire purpose of PDF files), pj can handle it without a glitch. The method of going through the PDF file and extracting all text from it through some kind of Reader (brought up by Paula New Cecil) probably wouldn't be effective either. Most PDFs are FlateDecoded, which means compressed using the Flate algorithm. You can actually read it in using java.util.zip.InflateInputStream and decompress it then though. I was bored and decided to try out the files that pj failed to handle, using xpdf v0.92 instead( specifically pdftotext, under windows). http://www.foolabs.com/xpdf Same results as with pj. Encrypted files are not extracted (Error: Copying of text from this document is not allowed.) Other files fail with some error or other. Does anyone have a solution for this?? :) Kelvin ----- Original Message ----- From: Kelvin Tan To: Lucene Users List Sent: Friday, November 23, 2001 6:48 PM Subject: Re: PDF parser for Lucene > I'm not too familiar with websearch's PDF parsing. > > I use a nice API Etymon Pj http://www.etymon.com/pj/ > > It doesn't come with the ability to extract text, but it can be coded. I'll > leave you to do it because it's kinda fun, but I could provide it if anyone > wants it. > > I've also implemented it so that the searches can be performed on a > page-by-page basis. That's pretty cool, i think. > > ----- Original Message ----- > From: > To: > Cc: > Sent: Friday, November 23, 2001 4:39 PM > Subject: RE: PDF parser for Lucene > > > > Hello, > > > > We have been using PDFHandler - a pdf parser provided by websearch, to > > search in pdf files. We are trying to get the contents using > > pdfHandler.getContents() to arrive at a context-sensitive summary. > However, > > it gives some yen signs and other special symbols in the title, summary > and > > contents. If anyone is using the websearch component to parse pdf files > and > > have encountered this problem, kindly give your suggestions. > > > > Note - Most of the pdf files are using WinAnsiEncoding, and setting the > > encoding as Win-12xx doesn't help. > > > > Thanks in advance, > > > > Sampreet > > Programmer > > > > > > You could try this one: > > http://www.i2a.com/websearch/ > > > > ...and then tell me how it works for you. > > =:o) > > > > > > Anyway, it is simple and Open Source. > > > > > > Have fun, > > Paulo Gaspar > > > > > > -- > > To unsubscribe, e-mail: > > > For additional commands, e-mail: > > > > > > > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > > -- To unsubscribe, e-mail: For additional commands, e-mail: