lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Lea" <ian....@gmail.com>
Subject Re: Beginner: Best way to index and display orginal text of pdfs in search results
Date Fri, 12 Dec 2008 09:49:20 GMT
Hi


Lucene can store the original text of the document.  You make the
lucene fields to do what you need.  Have a look at the apidocs for
Field.Store and you'll see that you've got three choices: Yes, No or
Compress.

For your display snapshots, have a look at the lucene highlighter package.

And all newcomers to Lucene could do a lot worse than getting hold of
a copy of Lucene in Action.  Somewhat out of date but the principles
are still valid.


--
Ian.

On Fri, Dec 12, 2008 at 8:34 AM, maxmil <mail@alwayssunny.com> wrote:
>
> Hi,
>
> This is the first time i am using Lucene.
>
> I need to index pdf's with very few fields, title, date and body (long
> field) for a web based search.
>
> The results i need to display have to show not only the documents found but
> for each document a snapshot of the text where the search term has been
> found. This is analogous to the way google displays search results. That is
> to say
>
>  ... some words and first instance of search Term and some more words ...
> some more words second instance of search term and some more words...
>
> etc.
>
> To do this i would need the original text of the document for each hit. As
> far as i understand Lucene does not save the original text of the document
> in the index.
>
> I am not using a database and would prefer not to have to store the original
> document text elsewhere.
>
> One way i could do this would be to take the hits from Lucene and reopen
> each pdf to extract the original text at run time however i fear that with
> many results this would be very slow.
>
> What would you recommend me to do?
>
> Thanks
>
> max
> --
> View this message in context: http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-text-of-pdfs-in-search-results-tp20971377p20971377.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message