lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cecil, Paula New" <c...@fuse.net>
Subject Re: PDF parser for Lucene
Date Sat, 24 Nov 2001 22:34:59 GMT
Relative to PDF's, Kevin is correct.  My reader class completely failed.  I
brought up the PDF in Textpad to take a look...  nothing there in "readable"
form.

My tests with typical office documents seemed to work ok as did some other
selections of non-text files.  Turbo-tax failed - I'm sure they encrypt,
which makes perfect sense.  A search also failed on PowerPoint's "word art"
text.



----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 7:48 PM
Subject: Re: PDF parser for Lucene


> Here's part of my email to Otis...with some additions at the bottom
>
> I was rather intrigued by Websearch's abilities and wanted to compare it
> with Pj's, so I ran both on a couple more PDFs and of a greater variety
than
> I had prior to this. The results were pretty disappointing.
>
> Generally any PDF file that can be processed by Websearch can be done by
Pj.
> Text is extracted and except for special characters (which are replaced by
a
> \{code}). Whilst I had previously enjoyed relative success with Pj for
> extracting text from PDFs, there were many PDF files in which it just fell
> flat on its face.
>
> Probing further, this is what I found. If the PDF is encrypted, generally
> the text can't be extracted (pj has a method where you call
> getEncryptedDictionary() which apparently returns an encrypted dictionary
if
> it is encrypted). If an encoding method other than ascii85 or flate is
used,
> pj can't handle it (I've seen a LZWdecode used. I suppose this is Zip).
And
> then there are other instances of which I haven't a clue...:)
>
> As a rule of thumb, if the PDF is all text (unpractical of course, and
> defeating the entire purpose of PDF files), pj can handle it without a
> glitch.
>
> The method of going through the PDF file and extracting all text from it
> through some kind of Reader (brought up by Paula New Cecil) probably
> wouldn't be effective either. Most PDFs are FlateDecoded, which means
> compressed using the Flate algorithm. You can actually read it in using
> java.util.zip.InflateInputStream and decompress it then though.
>
> <newly-added>
> I was bored and decided to try out the files that pj failed to handle,
using
> xpdf v0.92 instead( specifically pdftotext, under windows).
> http://www.foolabs.com/xpdf
>
> Same results as with pj. Encrypted files are not extracted (Error: Copying
> of text from this document is not allowed.)
> Other files fail with some error or other.
>
> Does anyone have a solution for this?? :)
> </newly-added>
>
> Kelvin
>
> ----- Original Message -----
> From: Kelvin Tan <kelvin@relevanz.com>
> To: Lucene Users List <lucene-user@jakarta.apache.org>
> Sent: Friday, November 23, 2001 6:48 PM
> Subject: Re: PDF parser for Lucene
>
>
> > I'm not too familiar with websearch's PDF parsing.
> >
> > I use a nice API Etymon Pj http://www.etymon.com/pj/
> >
> > It doesn't come with the ability to extract text, but it can be coded.
> I'll
> > leave you to do it because it's kinda fun, but I could provide it if
> anyone
> > wants it.
> >
> > I've also implemented it so that the searches can be performed on a
> > page-by-page basis. That's pretty cool, i think.
> >
> > ----- Original Message -----
> > From: <sampreet@interactive1.com>
> > To: <lucene-user@jakarta.apache.org>
> > Cc: <bkopic@interactive1.hr>
> > Sent: Friday, November 23, 2001 4:39 PM
> > Subject: RE: PDF parser for Lucene
> >
> >
> > > Hello,
> > >
> > > We have been using PDFHandler - a pdf parser provided by websearch, to
> > > search in pdf files. We are trying to get the contents using
> > > pdfHandler.getContents() to arrive at a context-sensitive summary.
> > However,
> > > it gives some yen signs and other special symbols in the title,
summary
> > and
> > > contents. If anyone is using the websearch component to parse pdf
files
> > and
> > > have encountered this problem, kindly give your suggestions.
> > >
> > > Note - Most of the pdf files are using WinAnsiEncoding, and setting
the
> > > encoding as Win-12xx doesn't help.
> > >
> > > Thanks in advance,
> > >
> > > Sampreet
> > > Programmer
> > >
> > >
> > > You could try this one:
> > > http://www.i2a.com/websearch/
> > >
> > > ...and then tell me how it works for you.
> > > =:o)
> > >
> > >
> > > Anyway, it is simple and Open Source.
> > >
> > >
> > > Have fun,
> > > Paulo Gaspar
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message