lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Need Lucene Compression help -- can pay nominal fee
Date Thu, 07 Jun 2007 18:40:15 GMT

: I need to store all the attributes of the document i index as part of the
: index. And I need to get the size of the files as close to 20% of the
: original size as possible. If anyone can help with this I can pay a nominal
: fee. Please contact me if anyone can help.

Let's be clear about something here: Lucene is an "indexing" technology,
not a compression technology.  Lucene indexes *can* be smaller then the
source corpus they index for a variety of reasons (most notably because
the inverted index structure allows a more susinect represetation of the
same words appearing in multiple docuemnts, but things like stop word
removal play a big part as well) but this reduced size only applies to the
*indexed* fields.  If you "store" data in Lucene there is nothing
intrinisic in the Lucene that makes it smaller.  If you have a chunk of
text, and you tell Lucene to store that text, and index that text, it
doesn't matter how space efficient the "index" is, Lucene still still
needs to store all of that text.

Here's a few simple steps to achieve what you want, and i won't charge you
anything for it...

  1) take your existing code for building an index, and change it so that
all of the options to "store" field values are commented out, and only the
"indexed" fields exist.
  2) index your corpus of size X, note the size of the index Y.  compute
the value Z such that:    Z = (0.20 * X) - Y
  3) go find a compression library A that can compress your corpus of size
X down to size Z
  4) go edit your orriginal code to use algorithm A to compress all of
your stored fields before adding them to Lucene.

Viola!



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message