lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas X Hoban" <tho...@caseknowledge.com>
Subject Re: Lucene - PDFBox
Date Wed, 25 May 2005 23:29:07 GMT
Thanks for your help.  I was using 0.7.0.  However, I installed 0.7.1 and 
get the same result with the ExtractText utility.  I will post an issue with 
the PDFBox sourceforge site.

Tom
----- Original Message ----- 
From: "Ben Litchfield" <ben@csh.rit.edu>
To: <java-user@lucene.apache.org>
Sent: Wednesday, May 25, 2005 5:18 PM
Subject: Re: Lucene - PDFBox


>
> There were some fixes around extra spaces in the 0.7.1 version of PDFBox,
> if you are not using that version please try it, otherwise post an issue
> on the PDFBox sourceforge site.
>
> http://sourceforge.net/tracker/?group_id=78314&atid=552832
>
> Thanks,
> Ben
>
>
>
> On Wed, 25 May 2005, Thomas X Hoban wrote:
>
>> Thanks for replying.
>>
>> When I run the command, it generates a file with a "txt" extension.  The
>> text in this file has spaces interspersed in odd spots.  Here is output 
>> from
>> a file I ran the command on...
>>
>> Marc h 29, 2005
>>
>> Hello t here m y good friend.
>>
>> HELLO
>>
>> Legal Soft w are is GOOD.
>>
>>
>>
>>
>> I would have expected this...
>>
>> GoOD
>>
>> March 29, 2005
>>
>> Hello there my good friend.
>>
>> HELLO
>>
>> Legal Software is GOOD.
>>
>> GoOD
>>
>>
>>
>> ----- Original Message -----
>> From: "Ben Litchfield" <ben@csh.rit.edu>
>> To: <java-user@lucene.apache.org>
>> Sent: Wednesday, May 25, 2005 4:38 PM
>> Subject: Re: Lucene - PDFBox
>>
>>
>> >
>> > Can you run the following command line application on the PDF to verify
>> > that the extracted text is correct
>> >
>> > java org.pdfbox.ExtractText <pdf-file>
>> >
>> > Ben
>> >
>> >
>> >
>> > On Wed, 25 May 2005, Thomas X Hoban wrote:
>> >
>> >>
>> >>
>> >> First, I am new to Lucene.
>> >>
>> >> Is there anyone out there who has had trouble getting hits when 
>> >> running
>> >> phrase queries against an index that contains content from PDF files.
>> >> For PDF documents, I create the document using
>> >> LucenePDFDocument.getDocument(file) and then add it to the index.  For
>> >> non-pdf documents, I create the document using
>> >> FileDocument.Document(file).
>> >>
>> >> For instance, I add documents with the following text:
>> >>
>> >> pdf1.pdf -- "Dave has good taste"
>> >> pdf2.pdf -- "Tom has good taste"
>> >> word1.doc -- "Liz has bad taste"
>> >> word2.doc -- "Troy has bad taste"
>> >>
>> >> When I search content for the following strings:
>> >>
>> >>     has good taste
>> >>       get expected results with hits on pdf1.doc, pdf2.doc, word1.doc 
>> >> and
>> >> word2.doc
>> >>
>> >>     "has good taste"
>> >>        get unexpected result: 0 hits
>> >>
>> >>     "has bad taste"
>> >>        get expected results with hits on word1.doc and word2.doc
>> >>
>> >> It seems that searching for individual words work fine for both PDF 
>> >> and
>> >> non-pdf files.  However, searching on a phrase (enclosed in quotes) 
>> >> works
>> >> on non-pdf files but not on files parsed with the LucenePDFDocument
>> >> class.
>> >>
>> >> Can anyone offer advise?
>> >>
>> >> Below is code for index creation.  It is the demo IndexFiles class
>> >> provided with Lucene along with some changes...
>> >>
>> >> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> >> import org.apache.lucene.index.IndexWriter;
>> >> import org.apache.lucene.document.Field;
>> >> import org.apache.lucene.document.Document;
>> >>
>> >> import java.io.File;
>> >> import java.io.FileNotFoundException;
>> >> import java.io.IOException;
>> >> import java.util.Date;
>> >>
>> >> //import javax.activation.MimetypesFileTypeMap;
>> >>
>> >> import org.pdfbox.searchengine.lucene.LucenePDFDocument;
>> >>
>> >>
>> >> class IndexFiles {
>> >>   public static void main(String[] args) throws IOException {
>> >>     String usage = "java " + IndexFiles.class + " <root_directory>";
>> >>     if (args.length == 0) {
>> >>       System.err.println("Usage: " + usage);
>> >>       System.exit(1);
>> >>     }
>> >>
>> >>     Date start = new Date();
>> >>     try {
>> >>       IndexWriter writer = new IndexWriter("index", new
>> >> StandardAnalyzer(), true);
>> >>       indexDocs(writer, new File(args[0]));
>> >>
>> >>       writer.optimize();
>> >>       writer.close();
>> >>
>> >>       Date end = new Date();
>> >>
>> >>       System.out.print(end.getTime() - start.getTime());
>> >>       System.out.println(" total milliseconds");
>> >>
>> >>     } catch (IOException e) {
>> >>       System.out.println(" caught a " + e.getClass() +
>> >>        "\n with message: " + e.getMessage());
>> >>     }
>> >>   }
>> >>
>> >>   public static void indexDocs(IndexWriter writer, File file)
>> >>     throws IOException {
>> >>     // do not try to index files that cannot be read
>> >>
>> >>     if (file.canRead()) {
>> >>       if (file.isDirectory()) {
>> >>         String[] files = file.list();
>> >>         // an IO error could occur
>> >>         if (files != null) {
>> >>           for (int i = 0; i < files.length; i++) {
>> >>             indexDocs(writer, new File(file, files[i]));
>> >>           }
>> >>         }
>> >>       } else {
>> >>         System.out.println("adding " + file);
>> >>         try {
>> >>
>> >>           Document doc = null;
>> >>           if (file.getName().indexOf(".pdf") >= 0)
>> >>               // 
>> >> writer.addDocument(LucenePDFDocument.getDocument(file));
>> >>               doc = LucenePDFDocument.getDocument(file);
>> >>           else
>> >>               doc = FileDocument.Document(file);
>> >>
>> >>           Field field = null;
>> >>           if (file.getPath().indexOf("case1") >=0)
>> >>               field = new Field("caseid", "1", false, true, false);
>> >>           else if (file.getPath().indexOf("case2") >=0)
>> >>               field = new Field("caseid", "2", false, true, false);
>> >>           else if (file.getPath().indexOf("case3") >=0)
>> >>               field = new Field("caseid", "3", false, true, false);
>> >>           else
>> >>               field = new Field("caseid", "0", false, true, false);
>> >>
>> >>           doc.add(field);
>> >>
>> >>           writer.addDocument(doc);
>> >>         }
>> >>         // at least on windows, some temporary files raise this 
>> >> exception
>> >> with an "access denied" message
>> >>         // checking if the file can be read doesn't help
>> >>         catch (FileNotFoundException fnfe) {
>> >>           ;
>> >>         }
>> >>       }
>> >>     }
>> >>   }
>> >> }
>> >>
>> >>
>> >> Here is the SearchFiles class with some minor modifications...
>> >>
>> >> import java.io.IOException;
>> >> import java.io.BufferedReader;
>> >> import java.io.InputStreamReader;
>> >> import java.util.StringTokenizer;
>> >>
>> >> import org.apache.lucene.analysis.Analyzer;
>> >> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> >> import org.apache.lucene.document.Document;
>> >> import org.apache.lucene.search.Searcher;
>> >> import org.apache.lucene.search.IndexSearcher;
>> >> import org.apache.lucene.search.Query;
>> >> import org.apache.lucene.search.BooleanQuery;
>> >> import org.apache.lucene.search.PhraseQuery;
>> >> import org.apache.lucene.search.Hits;
>> >> import org.apache.lucene.index.Term;
>> >> import org.apache.lucene.queryParser.QueryParser;
>> >> import org.apache.lucene.queryParser.ParseException;
>> >>
>> >> class SearchFiles {
>> >>
>> >>   private static Query getCaseQuery(String line, Analyzer analyzer)
>> >>   throws ParseException {
>> >>       BooleanQuery bq = new BooleanQuery();
>> >>       StringTokenizer st = new StringTokenizer(line);
>> >>       Query query = QueryParser.parse(line, "contents", analyzer);
>> >>       String caseId = null;
>> >>       while (st.hasMoreTokens()) {
>> >>           caseId = st.nextToken();
>> >>           System.out.println("build case query for " + caseId);
>> >>
>> >>           query = QueryParser.parse(caseId, "caseid", analyzer);
>> >>           bq.add(query, false, false);
>> >>       }
>> >>
>> >>       return bq;
>> >>   }
>> >>   public static void main(String[] args) {
>> >>     try {
>> >>       Searcher searcher = new IndexSearcher("index");
>> >>       Analyzer analyzer = new StandardAnalyzer();
>> >>
>> >>       BufferedReader in = new BufferedReader(new
>> >> InputStreamReader(System.in));
>> >>       while (true) {
>> >>         System.out.print("Query: ");
>> >>         String line = in.readLine();
>> >>         System.out.print("Cases: ");
>> >>         String caseLine = in.readLine();
>> >>         Query caseQuery = getCaseQuery(caseLine, analyzer);
>> >>
>> >>         if (line.length() == -1)
>> >>           break;
>> >>
>> >>
>> >>         Query query = QueryParser.parse(line, "contents", analyzer);
>> >>         // PhraseQuery query = new PhraseQuery();
>> >>         // query.add(new Term("contents",line));
>> >>         System.out.println("Searching for: " +
>> >> query.toString("contents"));
>> >>         /*
>> >>         BooleanQuery wholeQuery = new BooleanQuery();
>> >>         wholeQuery.add(caseQuery, true, false);
>> >>         wholeQuery.add(query,     true, false);
>> >>         Hits hits = searcher.search(wholeQuery);
>> >>         */
>> >>         Hits hits = searcher.search(query);
>> >>         System.out.println(hits.length() + " total matching 
>> >> documents");
>> >>
>> >>         final int HITS_PER_PAGE = 10;
>> >>         for (int start = 0; start < hits.length(); start +=
>> >> HITS_PER_PAGE) {
>> >>           int end = Math.min(hits.length(), start + HITS_PER_PAGE);
>> >>           for (int i = start; i < end; i++) {
>> >>             Document doc = hits.doc(i);
>> >>             String path = doc.get("path");
>> >>             if (path != null) {
>> >>               System.out.println(i + ". " + path);
>> >>             } else {
>> >>               String url = doc.get("url");
>> >>               if (url != null) {
>> >>                 System.out.println(i + ". " + url);
>> >>                 System.out.println("   - " + doc.get("title"));
>> >>               } else {
>> >>                 System.out.println(i + ". " + "No path nor URL for 
>> >> this
>> >> document");
>> >>               }
>> >>             }
>> >>           }
>> >>
>> >>           if (hits.length() > end) {
>> >>             System.out.print("more (y/n) ? ");
>> >>             line = in.readLine();
>> >>             if (line.length() == 0 || line.charAt(0) == 'n')
>> >>               break;
>> >>           }
>> >>         }
>> >>       }
>> >>       searcher.close();
>> >>
>> >>     } catch (Exception e) {
>> >>       System.out.println(" caught a " + e.getClass() +
>> >>                          "\n with message: " + e.getMessage());
>> >>     }
>> >>   }
>> >> }
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message