kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Kudu Encoding and configs to improve scan
Date Tue, 11 Oct 2016 01:58:45 GMT
Hey Amit,

Some responses below:

On Mon, Oct 10, 2016 at 5:27 AM, Amit Adhau <amit.adhau@globant.com> wrote:

> Hi Kudu Team,
> I was doing a testing for the Dictionary & Prefix Encoding in Kudu table.
> To do so, I have created two tables with same structure and same data.
> Inserted 1 billion records into both the tables, having on an average close
> to 1kb record size.
> I have observed below;
> On disk storage level - I have found substantial difference between the
> encoded column table and non-encoded column table size, as encoded column
> table took very less space as compare to non-encoded column table.

Yes, that's expected -- one of the most important purposes of encodings is
to reduce data size on disk.

> On validating scan performance - I have found that running queries against
> a table with encoded column took less time[always],  as compare to running
> queries on non-encoded column table.
> Can you please help me on below queries;
> 1. Scan on encoded columns takes less time, is this expected behavior?

It's often the case, especially if the data is large enough that it isn't
fitting in cache. There are some cases where it's not faster, though. For
example, if you use bitshuffle encodings on integers, and the size of the
column was small enough that it was fully cached, it would be faster to
scan unencoded integers compared to encoded ones. That balance changes,
though, if the data no longer fits in RAM, since the reduced IO cost (due
to the encoding) offsets the increased CPU cost (due to having to decode in
order to service the query).

With dictionary compression of strings, however, it should basically always
be the case that it's beneficial. This is especially true if you have
predicates on the encoded columns ('WHERE' clauses in SQL terminology), and
especially after v1.0 in which there were some optimizations in this area.

> 2. Just to confirm, In case of, composite primary key, as per
> understanding it can be helpful to have prefix encoding implemented on
> first column or first few columns where the values could be same Or may be
> a column like webpage url in clickstream logs can have Prefix encoding
> implemented.

For the case of string columns at the beginning of a composite key, you're
right that prefix encoding is often a good choice. Note that internally
Kudu synthesizes a "composite key" column (not exposed to the user) which
concatenates your PK columns, and that _always_ uses PREFIX encoding,
regardless of what you've selected for the columns themselves.

> 3. As per the release note for Dictionary encoding;
> "If the column values of a given row set are unable to be compressed
> because the number of unique values is too high, Kudu will transparently
> fall back to plain encoding for that row set"
> Is there any method to find out the probable upper number for unique
> values, that the dictionary encoding can handle and in such case, as stated
> it will back to plain encoding, So will it be applicable to the records
> inserted after the upper limit exceeds i.e. only they will be in plain
> encoding or kudu will convert all the values[including existing] for
> dictionary encoded column into plain encoding automatically? will there be
> any impact at functional level?

This is all fully automatic, and the choice of encoding happens at a small
block level, not at the entire table level. So even if you have a very
large number of unique values globally across the table, if "nearby" rows
(ie within a few MB of each other) have low number of distinct elements,
you will benefit from dictionary.

Dictionary compression is so often the correct choice for strings that I've
been thinking we should probably make it the default :)

> 4. Since gflags like --cfile_do_on_finish=flush and --flush_threshold_mb
> are defaults in latest versions. Are there any other tunning flags or
> configs that can be helpful to improve the performance at insert level.
> Also, at the scan level, we are using the ScanToken API & hash
> partitions, but still the scan performance seems to be slow, can you please
> suggest if anything else can be done at the configuration level or
> implementation level to improve the scan performance.
For inserts, there aren't any flags I can recommend that wouldn't have
negative consequences. However, it's worth noting that the upcoming 1.1
release will have a few optimizations on the write side that might increase
your throughput substantially, especially if you're using Impala to drive
the inserts.

On the read path, the most important thing is to make sure you have enough
partitions per node to get proper parallelism on the reads. But, there are
a lot of factors. Can you quantify what you mean by "slow", and
particularly what your point of reference is? Maybe share some sample
queries and dataset characteristics?

Todd Lipcon
Software Engineer, Cloudera

View raw message