lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: Canonicalize stored fields (small set of possible values)
Date Tue, 15 Mar 2016 16:41:20 GMT
Le mar. 15 mars 2016 à 17:33, Andreas Sewe <andreas.sewe@codetrails.com> a
écrit :

> I am afraid I don't understand. Do you suggest using IntFields as ID
> instead of StringFields, as they are presumably stored more efficiently?
>

Exactly. Integers are stored using zig-zag encoding and variable byte. So
numbers between -64 and 63 use 1 byte, numbers between -8192 and 8191 use 2
bytes, etc.


> > Otherwise, even without doing anything, things
> > should not be too bad thanks to stored fields compression.
>
> AFAICT, the fields are not compressed on disk right now. At least, "grep
> -c" finds my field over and over in the index files.
>
> So, how do I enabled stored fields compression. Googling turned up
> Store.COMPRESS, but that doesn't exist in 5.2.1.
>

Compression is on by default, but we split the stored fields file into
blocks of 16KB and compress each block individually. So each 16KB block
still needs to store values at least once before the compression algorithm
can make references to it.

If you want to enable stronger compression, you can do
`indexWriterConfig.setCodec(new Lucene54Codec(Mode.BEST_COMPRESSION))`
which will use DEFLATE insead of LZ4 to compress blocks. In addition of
removing duplicates like LZ4, DEFLATE also applies some Huffman coding so
that you should see better compression if your field values use some
symbols much more frequently than others.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message