lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: PDF Indexing Issue
Date Mon, 28 Jun 2004 21:02:41 GMT
Don Vaillancourt wrote:
> I used the following code example from an article that I linked off of 
> jakarta's site to index PDF files:
> 
> doc.add(Field.Text("content", new FileReader(f)));
> 
> But I realized today that this method only indexes the PDF as is.  For 
> those wondering if the the PDF were actually indexed or if maybe they 
> only contained images, well I verified this with LUKE and those PDFs are 
> in there, but the only keywords that were indexed were the PDF defintion 
> statements and encoded stuff.
> 
> So what is the proper way to index a PDF?

The proper way is to first pass the PDF file through a PDF parser (e.g. 
PDFBox), and then extract plain-text content (such as body, title, 
author, etc), and only then add that plaintext content to the index.


-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message