hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avdeev V. M. <>
Subject table design and performance questions
Date Mon, 28 May 2012 05:17:07 GMT
Question from novice.

Where I can read table design best practices? I have a measure table with millions of rows
and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal
design of both kind of tables. Is there performance tuning guides or performance FAQ?

1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements
make sense?
2) DDL language manual says 'This can improve performance on certain kinds of queries.' about
CLUSTERED BY statement. What kind of queries can be improved?
3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects
should be taken into account when choosing a file format?
4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases,
known to give better performance that uncompressed storage;' and again - What is these cases?

View raw message