ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Data compression in Ignite
Date Wed, 09 Aug 2017 14:48:12 GMT

I had several private talks with Igniters about data compression and would
like to share the summary with ... Igniters :-)

Currently all Ignite's data is uncompressed. It leads to excessive network
traffic, GC pressure and disk IO (in case of persistence). Most modern
databases are able to compress data, what gives them 2-4x size reduction on
typical workloads. We need compression in Ignite.

There are several options I'd like to discuss. The main difference between
them - on what "level" to compress: per-entry, per-data-page or per-cache.

*1) Per-entry compression*
Apache Geode uses this approach. Every cache entry is compressed using
Snappy. This is very easy to implement, but every entry access (e.g.
reading single field) require full decompression or even re-compression,
what could lead to higher CPU consumption and worse performance.

*2) Per-data-page compression*
Oracle and DB2 use this approach. Pages are compressed with
dictionary-based approach (e.g. LZV). It is important, that they do not
compress the whole page. Instead, only actual data is compressed, while
page structure remains intact. Dictionary is placed within the page. This
way it is possible to work with individual entries and even individual
fields without full page decompression. Another important thing - it is not
necessary to re-compress the page on each write. Instead, data is stored in
uncompressed form first, and compressed even after certain threshold is
reached. So negative CPU impact is minimal. Typical compression rate would
be higher than in per-entry case, because the more data you have, the
better it can be compressed.

*3) Per-cache compression*
Suggested by Alex Goncharuk. We could have a dictionary for the whole
cache. This way we could achieve the highest compression rate possible. The
downside is complex implementation - we would have to develop an algorithm
of sharing the dictionary within the cluster. At some point the dictionary
could become too huge to fit in-memory, so we should either control it's
size or spill it to disk.

I propose to use per-data-page approach as both gives nice compression rate
and relatively easy to implement.

Please share your thoughts.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message