lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Searching pdf, getting page number
Date Mon, 16 Oct 2006 12:15:24 GMT
Well, anything's possible <G>.

There's nothing magic about Lucene and its interaction with, say, a PDF
document. What you put into the index is all you can get out. So..

You could index the PDF document by pages. That is, each page is a lucene
"document", related by some ID (NOT the lucene doc_id, since that can
change).

You could index the document and give the first term of each page a large
positionincrementgap and reconstruct the page data.

You could index meta-data in a field of the document giving the term offsets
of each page start and reconstruct which page it came from.

You could insert a special token at the beginning of each page. You'd have
to count to get the page.

and on and on. The take-away here is that Lucene is a search *engine*, not a
package. You have to carefully construct your application around Lucene to
get this kind of meta-data out of it...

That said, there might already be a contribution and/or package out there
that does much of this for you, but I'm unaware of any...

Hope this helps at least a little
Erick

On 10/16/06, Christoph P├Ąchter <Paechter@htwg-konstanz.de> wrote:
>
> Hi,
>
> I know that I can index pdf-files (using a third-party library).
>
> Is it possible to search the index for a phrase, getting not only the
> document, but also the page number in the (pdf-)document?
> Or is it even possible to get a bookmark, leading to this page?
>
> I am thankful for any information you can provide me, either how to do
> this indicing and searching, or where I can find further information or
> example code.
>
> Kind regards
> Christoph
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message