lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Arni <>
Subject Indexing PDF documents with structure information
Date Tue, 14 Aug 2007 06:28:32 GMT
Hello Luceners

I have started a new project and need to index pdf documents.
There are several projects around, which allow to extract the content,
like pdfbox, xpdf and pjclassic.

As far as I studied the FAQ's and examples, all these
tools allow simple text extraction.

Which of these open source tool can you recommend the most?

My pdf documents are quite long (in average more than 60 pages long).
Therefore I would like to have additional structure information for 
This allows that the user not only gets the whole document as a result,
he also gets additional information like the page or the chapter, where
the relevant information is.

As anyone have similar requirements? Which of these tools
are the best to fit my requirements?

Thanks for your help

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message