lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Beginner: Best way to index and display orginal text of pdfs in search results
Date Fri, 12 Dec 2008 11:47:23 GMT

I also encountered these options of the Field constructor but I never  
found a way to be sure that the field is really not loaded in RAM and  
only return with Field.reader(). There seems to be no contract in the  
Moreover the reader access methods went away between 1.9 and 2.2 if I  
don't mistake... so I had the impression it was not wanted to store  
"blobs" in the index.

Also, reader is not enough to do a decent job to store PDFs.
It should be a binary format (so getBinaryValue() should be used) and  
it should be an input-stream and not an in-memory array!

Echoes of a long frustrated user which implemented its own "mass- 
storage" because of that.
thanks for hints and even contradictions!


Le 12-déc.-08 à 10:49, Ian Lea a écrit :
> Lucene can store the original text of the document.  You make the
> lucene fields to do what you need.  Have a look at the apidocs for
> Field.Store and you'll see that you've got three choices: Yes, No or
> Compress.
> For your display snapshots, have a look at the lucene highlighter  
> package.
> And all newcomers to Lucene could do a lot worse than getting hold of
> a copy of Lucene in Action.  Somewhat out of date but the principles
> are still valid.

View raw message