ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Kornev <andrewkor...@hotmail.com>
Subject Re: Data compression in Ignite 2.0
Date Wed, 27 Jul 2016 04:53:10 GMT
Dictionary compression requires some knowledge about data being compressed. For example, for
numeric types a range of values must be known so that the dictionary can be generated. For
strings, the number of unique values of the column is the key piece of input into the dictionary
generation.
SAP HANA is a column-based database system: it stores the fields of the data tuple individually
using the best compression for the given data type and the particular set of values. HANA
has been specifically built as a general purpose database, rather than as an afterthought
layer on top of an already existing distributed cache.
On the other hand, Ignite is a distributed cache implementation (a pretty good one!) that
in general requires no schema and stores its data in the row-based fashion. Its current design
doesn't land itself readily to the kind of optimizations HANA provides out of the box.
For the curios types among us, the implementation details of HANA are well documented in "In-memory
Data Management", by Hasso Plattner & Alexander Zeier.
Cheers
Andrey
_____________________________
From: Alexey Kuznetsov <akuznetsov@gridgain.com<mailto:akuznetsov@gridgain.com>>
Sent: Tuesday, July 26, 2016 5:36 AM
Subject: Re: Data compression in Ignite 2.0
To: <dev@ignite.apache.org<mailto:dev@ignite.apache.org>>


Sergey Kozlov wrote:
>> For approach 1: Put a large object into a partition cache will
force to update
the dictionary placed on replication cache. It may be time-expense
operation.
The dictionary will be built only once. And we could control what should be
put into dictionary, for example, we could check min and max size and
decide - put value to dictionary or not.

>> Approach 2-3 are make sense for rare cases as Sergi commented.
But it is better at least have a possibility to plug user code for
compression than not to have it at all.

>> Also I see a danger of OOM if we've got high compression level and try
to restore original value in memory.
We could easily get OOM with many other operations right now without
compression, I think it is not an issue, we could add a NOTE to
documentation about such possibility.

Andrey Kornev wrote:
>> ... in general I think compression is a great data. The cleanest way to
achieve that would be to just make it possible to chain the marshallers...
I think it is also good idea. And looks like it could be used for
compression with some sort of ZIP algorithm, but how to deal with
compression by dictionary substitution?
We need to build dictionary first. Any ideas?

Nikita Ivanov wrote:
>> SAP Hana does the compression by 1) compressing SQL parameters before
execution...
Looks interesting, but my initial point was about compression of cache
data, not SQL queries.
My idea was to make compression transparent for SQL engine when it will
lookup for data.

But idea of compressing SQL queries result looks very interesting, because
it is known fact, that SQL engine could consume quite a lot of heap for
storing result sets.
I think this should be discussed in separate thread.

Just for you information, in first message I mentioned that DB2 has
compression by dictionary and according to them it is possible to
compress usual data to 50-80%.
I have some experience with DB2 and can confirm this.

--
Alexey Kuznetsov



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message