lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathieu Lecarme <>
Subject Re: Indexing PDF documents with structure information
Date Tue, 14 Aug 2007 13:05:08 GMT
Thomas Arni a écrit :
> Hello Luceners
> I have started a new project and need to index pdf documents.
> There are several projects around, which allow to extract the content,
> like pdfbox, xpdf and pjclassic.
> As far as I studied the FAQ's and examples, all these
> tools allow simple text extraction.
> Which of these open source tool can you recommend the most?
pdftk or iText?
> My pdf documents are quite long (in average more than 60 pages long).
> Therefore I would like to have additional structure information for
> indexing.
> This allows that the user not only gets the whole document as a result,
> he also gets additional information like the page or the chapter, where
> the relevant information is.
page is simple to extract, chapter should be more tricky, if the
document got internal links.
PDF reader accept argument like in http to open a page.
> As anyone have similar requirements? Which of these tools
> are the best to fit my requirements?

Have a look to "PDF hacks" (ISBN: 0596006551). When your document will
be split, it will be easy to index it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message