lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Trent" <jtr...@structsoft.com>
Subject Re: PDF parser for Lucene
Date Sun, 25 Nov 2001 20:12:59 GMT
A Strings-like interpretation will bound to fail on PDFs since often times
PDF files have positioning information between every darn letter of a word.
The developers of PDF were apparently striving for ultimate flexibility for
formatting and positioning but performance (and larger files sizes) are the
obvious down side.


----- Original Message -----
From: "Cecil, Paula New" <cnew@fuse.net>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Saturday, November 24, 2001 5:46 PM
Subject: Re: PDF parser for Lucene


> It had occurred to me also that a useful enhancement might be to filter
out
> nonsense tokens.
>
> Certainly you are way ahead of me on this.  You're welcome to put your
logic
> in the class (or send it to me and I will take a stab).  My "BinaryReader"
> also squeezes out sequences of whitespace, but will replace a binary
> character with a whitespace under certain conditions.
>
> I found a lot of single letters in the MS Office files.  Which I think the
> analyzer will get rid of (???  I'm still pretty new to Lucene).
>
> At any rate here is my BinaryReader.  Improvements welcomed!
> /* usage example
>       FileReader fr = new FileReader(args[0]);
>       BufferedReader br = new BufferedReader(fr);
>       BinaryReader binr = new BinaryReader(br);
>       org.apache.lucene.document.Document doc =
>         new org.apache.lucene.document.Document();
>       doc.add(Field.UnIndexed("filename", args[0]));
>       doc.add(Field.Text("body",binr));
>       writer.addDocument(doc);
> */
>
> import java.util.*;
> import java.io.*;
>
> public class BinaryReader
>   extends java.io.FilterReader
> {
>   // private vars
>   private char prevchar = '\r';
>   private char thischar;
>
>   public BinaryReader(Reader in) {
>     super(in);
>   }
>
>   public int read()
>     throws IOException
>   {
>
>     int c = in.read();
>     if ( c != -1 ) {
>       if ( bintest((char)c) ) {
>         return thischar;
>       }
>     }
>     return c;
>   }
>
>   private boolean bintest(char c) {
>     if ( (c >= '!') && (c <= 'z') ) {
>       thischar = c;
>       prevchar = c;
>       return true;
>     } else if ( (c == '\n') ||
>                 (c == '\t') ||
>                 (c == '\r') )
>     {
>       thischar = c;
>       prevchar = ' ';
>       return true;
>     } else if ( prevchar != ' ' ) {
>       thischar = ' ';
>       prevchar = ' ';
>       return true;
>     }
>     return false;
>   }
>
>   public int read(char[] cbuf)
>     throws IOException
>   {
>     return read(cbuf, 0, cbuf.length);
>   }
>
>   public int read(char[] cbuf, int off, int len)
>     throws IOException
>   {
>     char[] cb = new char[len];
>     int cnt = in.read(cb);
>
>     if ( cnt == -1 ) return cnt; // done
>     int num = 0;
>     int loc = off;
>     for ( int i=0; i < cnt; i++ ) {
>       if ( bintest(cb[i]) ) {
>         cbuf[loc++] = thischar;
>         num++;
>       }
>     }
>     return num;
>   }
> }
>
> ----- Original Message -----
> From: Kelvin Tan <kelvin@relevanz.com>
> To: Lucene Users List <lucene-user@jakarta.apache.org>
> Sent: Friday, November 23, 2001 4:24 PM
> Subject: Re: PDF parser for Lucene
>
>
> > That's pretty interesting because I've done something similar, but I
> > (vainly) try to build some intelligence into it.
> >
> > First I run the text through a regex which filters out most of the words
> > which contain nonsense characters (^A-Za-z0-9_@).
> >
> > Then I run it through a couple of tests, which try to guess if this is
an
> > English word, like testing for 3 consonants in a row, and if there are 2
> > numbers which are not next to each other (can you think of any?!?!).
Based
> > on these, I rank the documents. Finally, if the word is less than 5
> > characters, I run it through a wordlist.
> >
> > The results are <emphasis>acceptable</emphasis>. The difficulty is that
> when
> > faced with large documents (>5MB), you end up with alot of nonsense term
s,
> > actually exceeding Lucene's inbuilt limit of 10000 terms (to limit
memory
> > usage). The result is that these documents are not indexed completely.
> >
> > I'd be interested to see how you filter your documents...:)
> >
> > ----- Original Message -----
> > From: Cecil, Paula New <cnew@fuse.net>
> > To: Lucene Users List <lucene-user@jakarta.apache.org>; Kelvin Tan
> > <kelvin@relevanz.com>
> > Sent: Saturday, November 24, 2001 12:36 AM
> > Subject: Re: PDF parser for Lucene
> >
> >
> > > Inspired by the Unix "strings" command, I have written a subclass of
> > > FilterReader; which I have called BinaryReader.  The idea is simply to
> > index
> > > any proprietary file format by filtering out all non-printable
> characters.
> > > The assumption is that text is text.  It will end up with more than
the
> > > "visible" text, but not less.  After I have tested and made some
> examples
> > I
> > > will post it here.
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Kelvin Tan <kelvin@relevanz.com>
> > > To: Lucene Users List <lucene-user@jakarta.apache.org>
> > > Sent: Friday, November 23, 2001 2:48 AM
> > > Subject: Re: PDF parser for Lucene
> > >
> > >
> > > > I'm not too familiar with websearch's PDF parsing.
> > > >
> > > > I use a nice API Etymon Pj http://www.etymon.com/pj/
> > > >
> > > > It doesn't come with the ability to extract text, but it can be
coded.
> > > I'll
> > > > leave you to do it because it's kinda fun, but I could provide it if
> > > anyone
> > > > wants it.
> > > >
> > > > I've also implemented it so that the searches can be performed on a
> > > > page-by-page basis. That's pretty cool, i think.
> > > >
> > > > ----- Original Message -----
> > > > From: <sampreet@interactive1.com>
> > > > To: <lucene-user@jakarta.apache.org>
> > > > Cc: <bkopic@interactive1.hr>
> > > > Sent: Friday, November 23, 2001 4:39 PM
> > > > Subject: RE: PDF parser for Lucene
> > > >
> > > >
> > > > > Hello,
> > > > >
> > > > > We have been using PDFHandler - a pdf parser provided by
websearch,
> to
> > > > > search in pdf files. We are trying to get the contents using
> > > > > pdfHandler.getContents() to arrive at a context-sensitive summary.
> > > > However,
> > > > > it gives some yen signs and other special symbols in the title,
> > summary
> > > > and
> > > > > contents. If anyone is using the websearch component to parse pdf
> > files
> > > > and
> > > > > have encountered this problem, kindly give your suggestions.
> > > > >
> > > > > Note - Most of the pdf files are using WinAnsiEncoding, and
setting
> > the
> > > > > encoding as Win-12xx doesn't help.
> > > > >
> > > > > Thanks in advance,
> > > > >
> > > > > Sampreet
> > > > > Programmer
> > > > >
> > > > >
> > > > > You could try this one:
> > > > > http://www.i2a.com/websearch/
> > > > >
> > > > > ...and then tell me how it works for you.
> > > > > =:o)
> > > > >
> > > > >
> > > > > Anyway, it is simple and Open Source.
> > > > >
> > > > >
> > > > > Have fun,
> > > > > Paulo Gaspar
> > > > >
> > > > >
> > > > > --
> > > > > To unsubscribe, e-mail:
> > > > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > > > For additional commands, e-mail:
> > > > <mailto:lucene-user-help@jakarta.apache.org>
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > <mailto:lucene-user-help@jakarta.apache.org>
> > > >
> > >
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message