lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: PDF Indexing Issue
Date Mon, 28 Jun 2004 21:02:41 GMT
Don Vaillancourt wrote:
> I used the following code example from an article that I linked off of 
> jakarta's site to index PDF files:
> doc.add(Field.Text("content", new FileReader(f)));
> But I realized today that this method only indexes the PDF as is.  For 
> those wondering if the the PDF were actually indexed or if maybe they 
> only contained images, well I verified this with LUKE and those PDFs are 
> in there, but the only keywords that were indexed were the PDF defintion 
> statements and encoded stuff.
> So what is the proper way to index a PDF?

The proper way is to first pass the PDF file through a PDF parser (e.g. 
PDFBox), and then extract plain-text content (such as body, title, 
author, etc), and only then add that plaintext content to the index.

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message