lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Lea" <>
Subject Re: Beginner: Best way to index and display orginal text of pdfs in search results
Date Fri, 12 Dec 2008 09:49:20 GMT

Lucene can store the original text of the document.  You make the
lucene fields to do what you need.  Have a look at the apidocs for
Field.Store and you'll see that you've got three choices: Yes, No or

For your display snapshots, have a look at the lucene highlighter package.

And all newcomers to Lucene could do a lot worse than getting hold of
a copy of Lucene in Action.  Somewhat out of date but the principles
are still valid.


On Fri, Dec 12, 2008 at 8:34 AM, maxmil <> wrote:
> Hi,
> This is the first time i am using Lucene.
> I need to index pdf's with very few fields, title, date and body (long
> field) for a web based search.
> The results i need to display have to show not only the documents found but
> for each document a snapshot of the text where the search term has been
> found. This is analogous to the way google displays search results. That is
> to say
>  ... some words and first instance of search Term and some more words ...
> some more words second instance of search term and some more words...
> etc.
> To do this i would need the original text of the document for each hit. As
> far as i understand Lucene does not save the original text of the document
> in the index.
> I am not using a database and would prefer not to have to store the original
> document text elsewhere.
> One way i could do this would be to take the hits from Lucene and reopen
> each pdf to extract the original text at run time however i fear that with
> many results this would be very slow.
> What would you recommend me to do?
> Thanks
> max
> --
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message