lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelvin Tan" <kel...@relevanz.com>
Subject Re: PDF parser for Lucene
Date Sat, 24 Nov 2001 03:48:08 GMT
Here's part of my email to Otis...with some additions at the bottom

I was rather intrigued by Websearch's abilities and wanted to compare it
with Pj's, so I ran both on a couple more PDFs and of a greater variety than
I had prior to this. The results were pretty disappointing.

Generally any PDF file that can be processed by Websearch can be done by Pj.
Text is extracted and except for special characters (which are replaced by a
\{code}). Whilst I had previously enjoyed relative success with Pj for
extracting text from PDFs, there were many PDF files in which it just fell
flat on its face.

Probing further, this is what I found. If the PDF is encrypted, generally
the text can't be extracted (pj has a method where you call
getEncryptedDictionary() which apparently returns an encrypted dictionary if
it is encrypted). If an encoding method other than ascii85 or flate is used,
pj can't handle it (I've seen a LZWdecode used. I suppose this is Zip). And
then there are other instances of which I haven't a clue...:)

As a rule of thumb, if the PDF is all text (unpractical of course, and
defeating the entire purpose of PDF files), pj can handle it without a
glitch.

The method of going through the PDF file and extracting all text from it
through some kind of Reader (brought up by Paula New Cecil) probably
wouldn't be effective either. Most PDFs are FlateDecoded, which means
compressed using the Flate algorithm. You can actually read it in using
java.util.zip.InflateInputStream and decompress it then though.

<newly-added>
I was bored and decided to try out the files that pj failed to handle, using
xpdf v0.92 instead( specifically pdftotext, under windows).
http://www.foolabs.com/xpdf

Same results as with pj. Encrypted files are not extracted (Error: Copying
of text from this document is not allowed.)
Other files fail with some error or other.

Does anyone have a solution for this?? :)
</newly-added>

Kelvin

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 6:48 PM
Subject: Re: PDF parser for Lucene


> I'm not too familiar with websearch's PDF parsing.
>
> I use a nice API Etymon Pj http://www.etymon.com/pj/
>
> It doesn't come with the ability to extract text, but it can be coded.
I'll
> leave you to do it because it's kinda fun, but I could provide it if
anyone
> wants it.
>
> I've also implemented it so that the searches can be performed on a
> page-by-page basis. That's pretty cool, i think.
>
> ----- Original Message -----
> From: <sampreet@interactive1.com>
> To: <lucene-user@jakarta.apache.org>
> Cc: <bkopic@interactive1.hr>
> Sent: Friday, November 23, 2001 4:39 PM
> Subject: RE: PDF parser for Lucene
>
>
> > Hello,
> >
> > We have been using PDFHandler - a pdf parser provided by websearch, to
> > search in pdf files. We are trying to get the contents using
> > pdfHandler.getContents() to arrive at a context-sensitive summary.
> However,
> > it gives some yen signs and other special symbols in the title, summary
> and
> > contents. If anyone is using the websearch component to parse pdf files
> and
> > have encountered this problem, kindly give your suggestions.
> >
> > Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> > encoding as Win-12xx doesn't help.
> >
> > Thanks in advance,
> >
> > Sampreet
> > Programmer
> >
> >
> > You could try this one:
> > http://www.i2a.com/websearch/
> >
> > ...and then tell me how it works for you.
> > =:o)
> >
> >
> > Anyway, it is simple and Open Source.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message