kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Adhau <amit.ad...@globant.com>
Subject Kudu Encoding and configs to improve scan
Date Mon, 10 Oct 2016 12:27:06 GMT
Hi Kudu Team,

I was doing a testing for the Dictionary & Prefix Encoding in Kudu table.
To do so, I have created two tables with same structure and same data.
Inserted 1 billion records into both the tables, having on an average close
to 1kb record size.
I have observed below;
On disk storage level - I have found substantial difference between the
encoded column table and non-encoded column table size, as encoded column
table took very less space as compare to non-encoded column table.
On validating scan performance - I have found that running queries against
a table with encoded column took less time[always],  as compare to running
queries on non-encoded column table.

Can you please help me on below queries;

1. Scan on encoded columns takes less time, is this expected behavior?

2. Just to confirm, In case of, composite primary key, as per understanding
it can be helpful to have prefix encoding implemented on first column or
first few columns where the values could be same Or may be a column like
webpage url in clickstream logs can have Prefix encoding implemented.

3. As per the release note for Dictionary encoding;
"If the column values of a given row set are unable to be compressed
because the number of unique values is too high, Kudu will transparently
fall back to plain encoding for that row set"
Is there any method to find out the probable upper number for unique
values, that the dictionary encoding can handle and in such case, as stated
it will back to plain encoding, So will it be applicable to the records
inserted after the upper limit exceeds i.e. only they will be in plain
encoding or kudu will convert all the values[including existing] for
dictionary encoded column into plain encoding automatically? will there be
any impact at functional level?

4. Since gflags like --cfile_do_on_finish=flush and --flush_threshold_mb
are defaults in latest versions. Are there any other tunning flags or
configs that can be helpful to improve the performance at insert level.
Also, at the scan level, we are using the ScanToken API & hash partitions,
but still the scan performance seems to be slow, can you please suggest if
anything else can be done at the configuration level or implementation
level to improve the scan performance.

Thanks & Regards,

*Amit Adhau* | Data Architect

*GLOBANT* | IND:+91 9821518132

[image: Facebook] <https://www.facebook.com/Globant>

[image: Twitter] <http://www.twitter.com/globant>

[image: Youtube] <http://www.youtube.com/Globant>

[image: Linkedin] <http://www.linkedin.com/company/globant>

[image: Pinterest] <http://pinterest.com/globant/>

[image: Globant] <http://www.globant.com/>


The information contained in this e-mail may be confidential. It has been 
sent for the sole use of the intended recipient(s). If the reader of this 
message is not an intended recipient, you are hereby notified that any 
unauthorized review, use, disclosure, dissemination, distribution or 
copying of this communication, or any of its contents, 
is strictly prohibited. If you have received it by mistake please let us 
know by e-mail immediately and delete it from your system. Many thanks.


La información contenida en este mensaje puede ser confidencial. Ha sido 
enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de 
este mensaje no fuera el destinatario previsto, por el presente queda Ud. 
notificado que cualquier lectura, uso, publicación, diseminación, 
distribución o copiado de esta comunicación o su contenido está 
estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje 
por error le agradeceremos notificarnos por e-mail inmediatamente y 
eliminarlo de su sistema. Muchas gracias.

View raw message