lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Lust <dl...@denali.be>
Subject Re: Question on Lucene when indexing big pdf files
Date Wed, 20 Aug 2003 07:28:58 GMT
I don't know if it can help you but here you are my code to extract 
code of pdf doc:

   /**
    * Extracts text from a pdf document
    *
    * @param in The InputStream representing the pdf file.
    * @return The text in the file
    */
   public String extractText(InputStream in)
   {
     String s = null;
     try
     {
       PDFTextStripper _stripper = new PDFTextStripper();
       PDFParser parser = new PDFParser(in);
       parser.parse();
       s = _stripper.getText(parser.getDocument());
     }
     catch (Throwable t)
     {
       t.printStackTrace();
     }
     return s;
   }






On Wednesday, August 20, 2003, at 08:59 AM, Yang Sun wrote:

> Hi,
>     I am a newbie on Lucene. Now I want to index all my harddisk 
> contents for searching, these includes html file, pdf file, word file 
> and etc. But I have encounter a problem when I try to index pdf files, 
> I need your help.
>     My environment is lucene-1.3-rc (lucene-1.2 has also been tried), 
> jdk1.4.02, pdfbox-0.62. I try to index all my pdfs. There seems no 
> error when executing the indexing (I use StandardAnalyzer, you can 
> refer to my sources in the attachment). But when I search using the 
> keyword, I find a lot of useless results. The pdf haven't contain the 
> content I want. Can you help me with this problem.
>     In my attachment, I put my source files and the test pdf files. 
> After I use my program to index these three pdf files, it seems all 
> right then. But when I search the result using keyword "cisco" based 
> on the indexing result, I get three Hits as the result. But two of the 
> results do not contain the keyword "cisco", they are useless. I wonder 
> if the pdfbox wrong, so I print out the indexed content, it also does 
> not contain the keyword "cisco". I use Luke and my searcher program as 
> the searching client, it seems no problem.
>     Can anyone help me? Or any comments on this problem. Everyone is 
> welcome. My email is suny@ffcs.fujitsu.co.jp.
>     suny
>
> PS: sorry, I can not attach the files, this mailing list can not hold 
> attachment? So I have to put my source codes here. My three test pdf 
> files are totally 100k, if someone would like to help me test it, I 
> will be very appreciate.
>
> IndexPDF.java
>
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import java.io.File;
> import java.util.Date;
> /**
>  * FileName:
>  * User: Administrator
>  * Date: 2003-8-19
>  * Time: 23:18:30
>  * Functions:
>  */
> public class IndexPDF {
>   File indexFiles;            //the file or directory we want to index
>   public static void indexDocs(IndexWriter writer, File file) throws 
> Exception {
>     if (file.isDirectory()) {
>       String[] files = file.list();
>       for (int i = 0; i < files.length; i++) {
>         indexDocs(writer, new File(file, files[i]));
>       }
>     } else if (file.getPath().endsWith(".pdf")) {
>       System.out.println("adding " + file);
>       writer.addDocument(PdfDocument.Document(file));
>     } else {
>       System.out.println("Ignoring " + file);
>     }
>   }
>   public static void main(String args[]) throws Exception {
>     if (args.length != 1) {
>       System.out.println("Usage: IndexPDF <file/directory>");
>       return;
>     }
>     try {
>       Date start = new Date();
>       IndexWriter writer = new IndexWriter("E:/Index", new 
> StandardAnalyzer(), true);
>       indexDocs(writer, new File(args[0]));
>       writer.optimize();
>       writer.close();
>       Date end = new Date();
>       System.out.println(end.getTime() - start.getTime());
>       System.out.println(" total milliseconds");
>     } catch (Exception e) {
>       System.out.println(" caught a " + e.getClass() +
>           "\n with message: " + e.getMessage());
>     }
>   }
> }
>
> PdfDocument.java
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.util.PDFTextStripper;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.ByteArrayOutputStream;
> import java.io.OutputStreamWriter;
> import net.vicp.resshare.weblucene.document.PDFDocument;
> /**
>  * FileName:
>  * User: Administrator
>  * Date: 2003-8-19
>  * Time: 23:20:54
>  * Functions:
>  */
> public class PdfDocument {
>   /* lucene object which represent a single data file */
>   static Document doc = new Document();
>   public static Document Document(File f ) {
>     //set relative path to the path field in lucene
>     doc.add(Field.UnIndexed("path", f.getPath()));
>     System.out.println("Path is " + f.getPath());
>     // use 10000 as the limit for temporary use
>     doc.add(Field.Text("content", getPDFContent(f)));
>     doc.add(Field.UnIndexed("filetype", "pdf"));
>     doc.add(Field.UnIndexed("title", f.getName()));
>     return doc;
>   }
>   /**
>    * get the text content from the specified pdf file.
>    * @param f the pdf we should extract the content
>    * @return  the string contains the pdf content
>    */
>   private static String getPDFContent(File f) {
>     byte[] contents = null;
>     try {
>       FileInputStream is = new FileInputStream(f);
>       PDFParser parser = new PDFParser( is );
>       parser.parse();
>       PDDocument nbsp = parser.getPDDocument();
>       ByteArrayOutputStream out = new ByteArrayOutputStream();
>       OutputStreamWriter writer = new OutputStreamWriter( out );
>       PDFTextStripper stripper = new PDFTextStripper();
>       stripper.writeText(nbsp.getDocument(), writer);
>       writer.close();
>       contents = out.toByteArray();
>     } catch (Exception e) {
>       e.printStackTrace();
>       return "";
>     }
>     String ts = new String(contents);
>     System.out.println("the string length is"+contents.length+"\n");
>     return ts;
>   }
> }
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message