lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathieu Lecarme <math...@garambrogne.net>
Subject Re: Indexing PDF documents with structure information
Date Tue, 14 Aug 2007 13:05:08 GMT
Thomas Arni a écrit :
> Hello Luceners
>
> I have started a new project and need to index pdf documents.
> There are several projects around, which allow to extract the content,
> like pdfbox, xpdf and pjclassic.
>
> As far as I studied the FAQ's and examples, all these
> tools allow simple text extraction.
>
> Which of these open source tool can you recommend the most?
pdftk or iText?
>
> My pdf documents are quite long (in average more than 60 pages long).
> Therefore I would like to have additional structure information for
> indexing.
> This allows that the user not only gets the whole document as a result,
> he also gets additional information like the page or the chapter, where
> the relevant information is.
page is simple to extract, chapter should be more tricky, if the
document got internal links.
PDF reader accept argument like in http to open a page.
>
> As anyone have similar requirements? Which of these tools
> are the best to fit my requirements?

Have a look to "PDF hacks" (ISBN: 0596006551). When your document will
be split, it will be easy to index it.

M.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message