lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelvin Tan" <kel...@relevanz.com>
Subject Re: PDF parser for Lucene
Date Fri, 23 Nov 2001 10:48:49 GMT
I'm not too familiar with websearch's PDF parsing.

I use a nice API Etymon Pj http://www.etymon.com/pj/

It doesn't come with the ability to extract text, but it can be coded. I'll
leave you to do it because it's kinda fun, but I could provide it if anyone
wants it.

I've also implemented it so that the searches can be performed on a
page-by-page basis. That's pretty cool, i think.

----- Original Message -----
From: <sampreet@interactive1.com>
To: <lucene-user@jakarta.apache.org>
Cc: <bkopic@interactive1.hr>
Sent: Friday, November 23, 2001 4:39 PM
Subject: RE: PDF parser for Lucene


> Hello,
>
> We have been using PDFHandler - a pdf parser provided by websearch, to
> search in pdf files. We are trying to get the contents using
> pdfHandler.getContents() to arrive at a context-sensitive summary.
However,
> it gives some yen signs and other special symbols in the title, summary
and
> contents. If anyone is using the websearch component to parse pdf files
and
> have encountered this problem, kindly give your suggestions.
>
> Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> encoding as Win-12xx doesn't help.
>
> Thanks in advance,
>
> Sampreet
> Programmer
>
>
> You could try this one:
> http://www.i2a.com/websearch/
>
> ...and then tell me how it works for you.
> =:o)
>
>
> Anyway, it is simple and Open Source.
>
>
> Have fun,
> Paulo Gaspar
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message