lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: about PDF / HTML index
Date Tue, 15 Jul 2003 23:07:13 GMT
Hi Alvaro,

there are some examples in our code here -- working with a slightly 
similar interface to the Ant task in the Lucene contributions.

  
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/indexer/documenthandler/

The actual step of turning it into a Lucene Document happens here:

  
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/indexer/DocumentProcessingFactory.java?rev=1.30&content-type=text/vnd.viewcvs-markup

This code is still work in progress, but it does work -- we are running 
it on a few ten thousand documents from time to time. Both PDFBox and 
Multivalent fail to read some PDF documents in the collection, but so 
does Acrobat Reader. We still have to do a more formal test to see which 
one does a better job, at the moment we are still coding the core bits, 
then we test properly.

HTH,
    Peter



alvaro z wrote:

>im using lucene with TXT and HTML files , its working.
>
>the only problem with HTML files is that i have to index html files as txt first , before
to index them as HTML.
>
>do anyone have try to index pdf files ? 
>
>im trying the pdfbox , is there any samples for indexing pdf files ? (i dont find any
samples to do that) with any of the parsers (pdfbox, jpedal ,etc).
>
>thanks for helping,
>
>Alvaro. from Lima - Peru
>
>
>---------------------------------
>Do you Yahoo!?
>SBC Yahoo! DSL - Now only $29.95 per month!
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message