lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jim shirreffs" <j...@verizon.net>
Subject Indexing PDF document
Date Wed, 06 Jun 2007 22:16:54 GMT
Well I got no where trying to index openoffice documents so I thought I try 
indexing PDF documents. Seemed Like PDFBox was a good bet, claimed to offer 
Lucene support and was on the Lucene recommended list. But after numerious 
attempts failed I decided try the IndexFiles.java that comes with PDFBox and 
I get the same error my modified Lucene demo code gets.

C:\PDFBox-0.7.3\classes>java 
org.pdfbox.searchengine.lucene.IndexFiles -create -index c:\index c:\test 
root=c:\test
Skipping c:\test\HTMLParser.java
Skipping c:\test\SearchFiles.java
Indexing PDF document: c:\test\doc.pdf
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.lucene.document.Document.add(Lo
rg/apache/lucene/document/Field;)V
        at 
org.pdfbox.searchengine.lucene.LucenePDFDocument.addUnindexedField(LucenePDFDocument.java:224)
        at 
org.pdfbox.searchengine.lucene.LucenePDFDocument.convertDocument(LucenePDFDocument.java:265)
        at 
org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:377)
        at 
org.pdfbox.searchengine.lucene.IndexFiles.addDocument(IndexFiles.java:295)
        at 
org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:269)
        at 
org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:236)
        at 
org.pdfbox.searchengine.lucene.IndexFiles.indexDocs(IndexFiles.java:223)
        at 
org.pdfbox.searchengine.lucene.IndexFiles.index(IndexFiles.java:165)
        at 
org.pdfbox.searchengine.lucene.IndexFiles.main(IndexFiles.java:140)


This is quite curious since my code to index text documents does this 
suscessfully

  /*
   * Add title
   */
  document.add(new Field("title", title, Field.Store.YES, 
Field.Index.UN_TOKENIZED));

  And looking at the failing PDFBox code it is doing the EXACT SAME THING

  document.add( new Field( name, value, Field.Store.YES, Field.Index.NO ) );


Very strange since the exception is NoSuchMethod  Document.add(Field)

And my custom code doing a doc.add(Field) works but PDFBox's code doing a 
doc.add(Field) does not.

As a classpath problem check I tried this

public class IndexMain
{
     public void indexDoc(String filename, String title, String objectId, 
String nodeId) throws Exception
     {
          File INDEX_DIR = new File("index");
          KcmiDocument kcmiDoc=null;
          Document pdfDocument=null;
          LucenePDFDocument lpdf = new LucenePDFDocument();

          IndexWriter writer = new IndexWriter(INDEX_DIR, new 
StandardAnalyzer());

          File file = new File(filename);

          if (filename.endsWith("pdf"))
               pdfDocument = lpdf.getDocument(file);
          else
               kcmiDoc = new KcmiDocument(objectId, title);
 }

Where KcmiDocument does the doc.add(Field) and lpdf.getDocument does the 
doc.add(Field)

when I send in a .txt file all is well, when I send in a .pdf file the 
exception is thrown.

If anyone knows that I am doing wrong or of another easy method to extract 
text from a pdf file I would centrainly like to know. I can live without 
openoffice (for a while) but not being able to index pdf would be a Lucene 
show stopper.


thanks
jim s















---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message