lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <>
Subject RE: Binary fields and data compression
Date Tue, 31 Aug 2004 00:19:21 GMT
My estimates are based on our own projects where we see that adding a
DeflatorInputStream around an InputStream takes about 20% of the CPU time,
so whether to actually use it or not will depend on if the IndexReader is
performance bound by the CPU or IO.

The problem with the "after the read" decompression, is that you still incur
the overhead of decompression each time the file block is accessed, since
the OS only caches the uncompressed block (unless Lucene adds caching to the
index read operations), but the disk IO time is almost always eliminated if
the index reader frequently accessed the same file blocks (since the OS
caches the data block).

If Lucene is IO bound, then increasing the OS cache helps, but you will
limit the throughput enhancements because now the CPU cycles are used to
uncompress the blocks.

With high enough limits on physical memory and disk space, I believe the
compression will have negative effects on overall performance, but again,
this is going to depend heavily on the environment (# of CPUS, physical
memory, memory architecture, disk speed, etc.). Given the boundary condition
where the entire index is loaded into physical memory (I think I read
somewhere recently that this is the current scheme that Google uses),
compression will have a negative impact on the performance - as the memory
to index size ratio lowers compression will probably help the overall

... thus my request that any compression support be optional.

-----Original Message-----
From: David Spencer []
Sent: Monday, August 30, 2004 5:33 PM
To: Lucene Developers List
Subject: Re: Binary fields and data compression

Robert Engels wrote:

> The data size savings is almost certainly not worth the probable 20-40%
> increase in CPU usage in most cases no?
> I think it should be optional for those who have extremely large indices
> want to save some space (seems not necessary these days), and those who
> to maximize performance.

You don't know until you benchmark it, but I thought that the heuristic
nowadays was that CPUs are fast and disk i/o is slow ( and yes, disk
space is 'infinite' :) ) - so therefore I would guess that in spite of
the CPU cost of compression, you'd save time due to less disk i/o.

> -----Original Message-----
> From: Bernhard Messer []
> Sent: Monday, August 30, 2004 4:41 PM
> To:
> Subject: Binary fields and data compression
> hi developers,
> a few month ago, there was a very interesting discussion about field
> compression and the possibility to store binary field values within a
> lucene document. Regarding to this topic, Drew Farris came up with a
> patch to add the necessary functionality. I ran all the necessary tests
> on his implementation and didn't find one problem. So the original
> implementation from Drew could now be enhanced to compress the binary
> field data (maybe even the text fields if they are stored only) before
> writing to disc. I made some simple statistical measurements using the
> package for data compression. Enabling it, we could save
> about 40% data when compressing plain text files with a size from 1KB to
> 4KB. If there is still some interest, we could first try to update the
> patch, because it's outdated due to several changes within the Fields
> class. After finishing that, compression could be added to the updated
> version of the patch.
> sounds good to me, what do you think ?
> best regards
> Bernhard
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message