lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <>
Subject Re: effectiveness of compression
Date Wed, 15 Feb 2012 09:54:29 GMT
for now lucene don't provide any thing like this.
maybe you can diff each version before add them into index . so it just
indexes and stores difference for newer version.

On Wed, Feb 15, 2012 at 4:25 PM, Jamie <> wrote:

> Greetings All.
> I'd like to index data corresponding to different versions of the same
> file. These files consists of PDF documents, word documents, and the like.
> So as to ensure that no information is lost, I'd like to create a new
> Lucene document for every version (or change) in a file. Each version of a
> file will have text added and removed, however, there is likely to be a
> high degree data duplication across the different versions. Assuming this
> indexed data is largely tokenized, to what extent will Lucene compress the
> data? Will it take into account that the data already exists in the index?
> I am worried about our index size growing too large when pursuing this
> strategy (i.e. one of creating a new Lucene document for every version of a
> file).
> Many thanks for your consideration.
> Jamie
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**<>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message