lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cecil, Paula New" <c...@fuse.net>
Subject Re: PDF parser for Lucene
Date Sat, 24 Nov 2001 22:46:39 GMT
It had occurred to me also that a useful enhancement might be to filter out
nonsense tokens.

Certainly you are way ahead of me on this.  You're welcome to put your logic
in the class (or send it to me and I will take a stab).  My "BinaryReader"
also squeezes out sequences of whitespace, but will replace a binary
character with a whitespace under certain conditions.

I found a lot of single letters in the MS Office files.  Which I think the
analyzer will get rid of (???  I'm still pretty new to Lucene).

At any rate here is my BinaryReader.  Improvements welcomed!
/* usage example
      FileReader fr = new FileReader(args[0]);
      BufferedReader br = new BufferedReader(fr);
      BinaryReader binr = new BinaryReader(br);
      org.apache.lucene.document.Document doc =
        new org.apache.lucene.document.Document();
      doc.add(Field.UnIndexed("filename", args[0]));
      doc.add(Field.Text("body",binr));
      writer.addDocument(doc);
*/

import java.util.*;
import java.io.*;

public class BinaryReader
  extends java.io.FilterReader
{
  // private vars
  private char prevchar = '\r';
  private char thischar;

  public BinaryReader(Reader in) {
    super(in);
  }

  public int read()
    throws IOException
  {

    int c = in.read();
    if ( c != -1 ) {
      if ( bintest((char)c) ) {
        return thischar;
      }
    }
    return c;
  }

  private boolean bintest(char c) {
    if ( (c >= '!') && (c <= 'z') ) {
      thischar = c;
      prevchar = c;
      return true;
    } else if ( (c == '\n') ||
                (c == '\t') ||
                (c == '\r') )
    {
      thischar = c;
      prevchar = ' ';
      return true;
    } else if ( prevchar != ' ' ) {
      thischar = ' ';
      prevchar = ' ';
      return true;
    }
    return false;
  }

  public int read(char[] cbuf)
    throws IOException
  {
    return read(cbuf, 0, cbuf.length);
  }

  public int read(char[] cbuf, int off, int len)
    throws IOException
  {
    char[] cb = new char[len];
    int cnt = in.read(cb);

    if ( cnt == -1 ) return cnt; // done
    int num = 0;
    int loc = off;
    for ( int i=0; i < cnt; i++ ) {
      if ( bintest(cb[i]) ) {
        cbuf[loc++] = thischar;
        num++;
      }
    }
    return num;
  }
}

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 4:24 PM
Subject: Re: PDF parser for Lucene


> That's pretty interesting because I've done something similar, but I
> (vainly) try to build some intelligence into it.
>
> First I run the text through a regex which filters out most of the words
> which contain nonsense characters (^A-Za-z0-9_@).
>
> Then I run it through a couple of tests, which try to guess if this is an
> English word, like testing for 3 consonants in a row, and if there are 2
> numbers which are not next to each other (can you think of any?!?!). Based
> on these, I rank the documents. Finally, if the word is less than 5
> characters, I run it through a wordlist.
>
> The results are <emphasis>acceptable</emphasis>. The difficulty is that
when
> faced with large documents (>5MB), you end up with alot of nonsense terms,
> actually exceeding Lucene's inbuilt limit of 10000 terms (to limit memory
> usage). The result is that these documents are not indexed completely.
>
> I'd be interested to see how you filter your documents...:)
>
> ----- Original Message -----
> From: Cecil, Paula New <cnew@fuse.net>
> To: Lucene Users List <lucene-user@jakarta.apache.org>; Kelvin Tan
> <kelvin@relevanz.com>
> Sent: Saturday, November 24, 2001 12:36 AM
> Subject: Re: PDF parser for Lucene
>
>
> > Inspired by the Unix "strings" command, I have written a subclass of
> > FilterReader; which I have called BinaryReader.  The idea is simply to
> index
> > any proprietary file format by filtering out all non-printable
characters.
> > The assumption is that text is text.  It will end up with more than the
> > "visible" text, but not less.  After I have tested and made some
examples
> I
> > will post it here.
> >
> >
> >
> > ----- Original Message -----
> > From: Kelvin Tan <kelvin@relevanz.com>
> > To: Lucene Users List <lucene-user@jakarta.apache.org>
> > Sent: Friday, November 23, 2001 2:48 AM
> > Subject: Re: PDF parser for Lucene
> >
> >
> > > I'm not too familiar with websearch's PDF parsing.
> > >
> > > I use a nice API Etymon Pj http://www.etymon.com/pj/
> > >
> > > It doesn't come with the ability to extract text, but it can be coded.
> > I'll
> > > leave you to do it because it's kinda fun, but I could provide it if
> > anyone
> > > wants it.
> > >
> > > I've also implemented it so that the searches can be performed on a
> > > page-by-page basis. That's pretty cool, i think.
> > >
> > > ----- Original Message -----
> > > From: <sampreet@interactive1.com>
> > > To: <lucene-user@jakarta.apache.org>
> > > Cc: <bkopic@interactive1.hr>
> > > Sent: Friday, November 23, 2001 4:39 PM
> > > Subject: RE: PDF parser for Lucene
> > >
> > >
> > > > Hello,
> > > >
> > > > We have been using PDFHandler - a pdf parser provided by websearch,
to
> > > > search in pdf files. We are trying to get the contents using
> > > > pdfHandler.getContents() to arrive at a context-sensitive summary.
> > > However,
> > > > it gives some yen signs and other special symbols in the title,
> summary
> > > and
> > > > contents. If anyone is using the websearch component to parse pdf
> files
> > > and
> > > > have encountered this problem, kindly give your suggestions.
> > > >
> > > > Note - Most of the pdf files are using WinAnsiEncoding, and setting
> the
> > > > encoding as Win-12xx doesn't help.
> > > >
> > > > Thanks in advance,
> > > >
> > > > Sampreet
> > > > Programmer
> > > >
> > > >
> > > > You could try this one:
> > > > http://www.i2a.com/websearch/
> > > >
> > > > ...and then tell me how it works for you.
> > > > =:o)
> > > >
> > > >
> > > > Anyway, it is simple and Open Source.
> > > >
> > > >
> > > > Have fun,
> > > > Paulo Gaspar
> > > >
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > <mailto:lucene-user-help@jakarta.apache.org>
> > > >
> > > >
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> > >
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message