lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xx28 <x...@drexel.edu>
Subject RE: [ANN] PDFBox 0.6.0
Date Thu, 06 Mar 2003 14:30:47 GMT
Ben,

I downloaded pdfbox and installed it. And I can use:
 java org.pdfbox.Main <PDF-file> <output-text-file>
to convert .pdf file to string file.

Then I tried to integrate with Lucene. I modified the following codes in 
IndexHTML.java:

else if(file.getPath().endsWith(".pdf")) {
        Document doc =  LucenePDFDocument.getDocument(file);
        System.out.println("adding " + "pdf files");
        writer.addDocument(doc);
        }

It did pass ant compiler (ant wardemo). However, when I tested:
java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..

It seems to me it still didnot pick up new IndexHTML.java, still did not index 
.pdf files.


Did I miss something here?

Regards,

George

>===== Original Message From Lucene Users List 
<lucene-user@jakarta.apache.org> =====
>I would like to announce the next release of PDFBox.  PDFBox allows for
>PDF documents to be indexed using lucene through a simple interface.
>Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
>which will extract all text and PDF document summary properties as lucene
>fields.
>
>You can obtain the latest release from http://www.pdfbox.org
>
>Please send all bug reports to me and attach the PDF document when
>possible.
>
>RELEASE 0.6.0
>-Massive improvements to memory footprint.
>-Must call close() on the COSDocument(LucenePDFDocument does this for you)
>-Really fixed the bug where small documents were not being indexed.
>-Fixed bug where no whitespace existed between obj and start of object.
>    Exception in thread "main" java.io.IOException: expected='obj'
>    actual='obj<</Pro
>-Fixed issue with spacing where textLineMatrix was not being copied
> properly
>-Fixed 'bug' where parsing would fail with some pdfs with double endobj
> definitions
>-Added PDF document summary fields to the lucene document
>
>
>Thank you,
>Ben Litchfield
>http://www.pdfbox.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message