kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From danburk...@apache.org
Subject [2/2] kudu git commit: [docs] Added recommendation to compress PK for backfill inserts
Date Mon, 05 Feb 2018 18:27:59 GMT
[docs] Added recommendation to compress PK for backfill inserts

Change-Id: I698d954265b4171e4d1bb7e01e286d0d489f1ec7
Reviewed-on: http://gerrit.cloudera.org:8080/9185
Reviewed-by: Dan Burkert <dan@cloudera.com>
Tested-by: Dan Burkert <dan@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/c19a4217
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/c19a4217
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/c19a4217

Branch: refs/heads/master
Commit: c19a42170576699d6f346faa716ae6f2fa665a97
Parents: 2c89bd7
Author: Alex Rodoni <arodoni@cloudera.com>
Authored: Thu Feb 1 16:16:45 2018 -0800
Committer: Dan Burkert <dan@cloudera.com>
Committed: Mon Feb 5 18:27:33 2018 +0000

----------------------------------------------------------------------
 docs/schema_design.adoc | 84 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 65 insertions(+), 19 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/c19a4217/docs/schema_design.adoc
----------------------------------------------------------------------
diff --git a/docs/schema_design.adoc b/docs/schema_design.adoc
index c5407a6..7f0e218 100644
--- a/docs/schema_design.adoc
+++ b/docs/schema_design.adoc
@@ -155,34 +155,80 @@ recommended to apply additional compression on top of this encoding.
 [[primary-keys]]
 == Primary Key Design
 
-Every Kudu table must declare a primary key index comprised of one or more
-columns. Primary key columns must be non-nullable, and may not be a boolean or
-floating-point type. Once set during table creation, the set of columns in the
-primary key may not be altered. Like an RDBMS primary key, the Kudu primary key
-enforces a uniqueness constraint; attempting to insert a row with the same
-primary key values as an existing row will result in a duplicate key error.
-
-Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature, so
-the application must always provide the full primary key during insert. Row
-delete and update operations must also specify the full primary key of the row
-to be changed; Kudu does not natively support range deletes or updates. The
-primary key values of a column may not be updated after the row is inserted;
-however, the row may be deleted and re-inserted with the updated value.
+Every Kudu table must declare a primary key comprised of one or more columns.
+Like an RDBMS primary key, the Kudu primary key enforces a uniqueness constraint.
+Attempting to insert a row with the same primary key values as an existing row
+will result in a duplicate key error.
+
+Primary key columns must be non-nullable, and may not be a boolean or floating-
+point type.
+
+Once set during table creation, the set of columns in the primary key may not
+be altered.
+
+Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature,
+so the application must always provide the full primary key during insert.
+
+Row delete and update operations must also specify the full primary key of the
+row to be changed. Kudu does not natively support range deletes or updates.
+
+The primary key values of a column may not be updated after the row is inserted.
+However, the row may be deleted and re-inserted with the updated value.
+
 
 [[indexing]]
 === Primary Key Index
 
-As with many traditional relational databases, Kudu's primary key is a clustered
-index. All rows within a tablet are kept in primary key sorted order. Kudu scans
-which specify equality or range constraints on the primary key will
-automatically skip rows which can not satisfy the predicate. This allows
-individual rows to be efficiently found by specifying equality constraints on
-the primary key columns.
+As with many traditional relational databases, Kudu’s primary key is in a
+clustered index. All rows within a tablet are sorted by its primary key.
+
+When scanning Kudu rows, use equality or range predicates on primary key
+columns to efficiently find the rows.
 
 NOTE: Primary key indexing optimizations apply to scans on individual tablets.
 See the <<partition-pruning>> section for details on how scans can use
 predicates to skip entire tablets.
 
+[[Backfilling]]
+=== Considerations for Backfill Inserts
+
+This section discuss a primary key design consideration for timeseries use
+cases where the primary key is a timestamp, or the first column of the primary
+key is a timestamp.
+
+Each time a row is inserted into a Kudu table, Kudu looks up the primary key in
+the primary key index storage to check whether that primary key is already
+present in the table. If the primary key exists in the table, a "duplicate key"
+error is returned.  In the typical case where data is being inserted at
+the current time as it arrives from the data source, only a small range of
+primary keys are "hot". So, each of these "check for presence" operations is
+very fast. It hits the cached primary key storage in memory and doesn't require
+going to disk.
+
+In the case when you load historical data, which is called "backfilling", from
+an offline data source, each row that is inserted is likely to hit a cold area
+of the primary key index which is not resident in memory and will cause one or
+more HDD disk seeks. For example, in a normal ingestion case where Kudu sustains
+a few million inserts per second, the "backfill" use case might sustain only
+a few thousand inserts per second.
+
+To alleviate the performance issue during backfilling, consider the following
+options:
+
+* Make the primary keys more compressible.
++
+For example, with the first column of a primary key being a random ID of 32-bytes,
+caching one billion primary keys would require at least 32 GB of RAM to stay in
+cache. If caching backfill primary keys from several days ago, you need to have
+several times 32 GB of memory. By changing the primary key to be more compressible,
+you increase the likelihood that the primary keys can fit in cache and thus
+reducing the amount of random disk I/Os.
++
+
+* Use SSDs for storage as random seeks are orders of magnitude faster than spinning disks.
+
+* Change the primary key structure such that the backfill writes hit a continuous range of
primary keys.
+
 [[partitioning]]
 == Partitioning
 


Mime
View raw message