lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cecil, Paula New" <c...@fuse.net>
Subject Re: PDF parser for Lucene
Date Fri, 23 Nov 2001 16:36:58 GMT
Inspired by the Unix "strings" command, I have written a subclass of
FilterReader; which I have called BinaryReader.  The idea is simply to index
any proprietary file format by filtering out all non-printable characters.
The assumption is that text is text.  It will end up with more than the
"visible" text, but not less.  After I have tested and made some examples I
will post it here.



----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 2:48 AM
Subject: Re: PDF parser for Lucene


> I'm not too familiar with websearch's PDF parsing.
>
> I use a nice API Etymon Pj http://www.etymon.com/pj/
>
> It doesn't come with the ability to extract text, but it can be coded.
I'll
> leave you to do it because it's kinda fun, but I could provide it if
anyone
> wants it.
>
> I've also implemented it so that the searches can be performed on a
> page-by-page basis. That's pretty cool, i think.
>
> ----- Original Message -----
> From: <sampreet@interactive1.com>
> To: <lucene-user@jakarta.apache.org>
> Cc: <bkopic@interactive1.hr>
> Sent: Friday, November 23, 2001 4:39 PM
> Subject: RE: PDF parser for Lucene
>
>
> > Hello,
> >
> > We have been using PDFHandler - a pdf parser provided by websearch, to
> > search in pdf files. We are trying to get the contents using
> > pdfHandler.getContents() to arrive at a context-sensitive summary.
> However,
> > it gives some yen signs and other special symbols in the title, summary
> and
> > contents. If anyone is using the websearch component to parse pdf files
> and
> > have encountered this problem, kindly give your suggestions.
> >
> > Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> > encoding as Win-12xx doesn't help.
> >
> > Thanks in advance,
> >
> > Sampreet
> > Programmer
> >
> >
> > You could try this one:
> > http://www.i2a.com/websearch/
> >
> > ...and then tell me how it works for you.
> > =:o)
> >
> >
> > Anyway, it is simple and Open Source.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message