lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maxmil <>
Subject Beginner: Best way to index and display orginal text of pdfs in search results
Date Fri, 12 Dec 2008 08:34:28 GMT


This is the first time i am using Lucene.

I need to index pdf's with very few fields, title, date and body (long
field) for a web based search.

The results i need to display have to show not only the documents found but
for each document a snapshot of the text where the search term has been
found. This is analogous to the way google displays search results. That is
to say

 ... some words and first instance of search Term and some more words ...
some more words second instance of search term and some more words... 


To do this i would need the original text of the document for each hit. As
far as i understand Lucene does not save the original text of the document
in the index.

I am not using a database and would prefer not to have to store the original
document text elsewhere.

One way i could do this would be to take the hits from Lucene and reopen
each pdf to extract the original text at run time however i fear that with
many results this would be very slow. 

What would you recommend me to do?


View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message