kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ehud Eshet <Ehud.Es...@imperva.com>
Subject Database audit analytics - Kudu vs. Parquet
Date Mon, 31 Jul 2017 13:13:22 GMT
Hi all,
High level use case description
Database Activity Monitoring captures audit records from various relational and NoSQL databases.
Big customers may have thousands of different monitored databases that may generate (together)
billions to trillion audit records per day.
Audit records are loaded once, queried many times, and never being updated.
Customers may define one audit policy to audit all DB activity or multiple audit policies
on which each will audit a sub-set of DB activity.
Audit records are highly structured. Each record contains:

*        Event time stamp and sequence id.

*        Database identifiers such as: Server IP, Listener port, Instance Name, DB Name, DBMS
type, DB account.

*        Client identifiers such as: Client IP, Client port, OS User, Host, Application, Connection

*        Operation identifiers such as: Audit policy, Operation, Query text, Bind variables
values, Object schema, Object name, Object type, Error code.

*        Counters such as: Response time, Response size, Affected rows.
The actual structure contains more than 50 columns. Most of them are string identifiers (dimensions).
Audit analytics using Spark SQL with Parquet
Parquet columnar encoding and compression provides a compression ratio of 1 to 100 on my typical
Encoding is selected automatically per column by the parquet loader based on the actual data
loaded into the current 128MB rows set.
Multiple encodings are supported for same column. E.g. Dictionary and then RLE which are best
together when a column has both small amount of long strings and repeating values.
Spark Parquet integration allows each Spark worker node to read Parquet file blocks that are
stored on the local HDFS node.
No range partitions nor hash partitions are supported.
Multi-level list partitions are supported. Adding a partition is just a matter of creating
additional HDFS sub-directory with the appropriate naming conventions.
The main pain point with Parquet files is that you cannot update a file after closing it.
To append records, you can only add a new Parquet file on the same HDFS sub-directory.
As long as Parquet file is still open, it is not available for reporting.
Parquet files are not necessarily sorted. Even when they are, Spark SQL will not take advantage
on sort order while filtering by sorted leading columns.
Audit analytics using Spark SQL and Kudu
Kudu solve the main pain point of static Parquet files.
However, it introduces other problems:

1.      No support for combination of RLE and dictionary encoding on same String columns (actually
RLE is not supported even alone for string columns).
Same typical data suffer from lower compression ratio (1 to 14).

2.      Incomplete Spark SQL integration

a.      Tables can be created via Impala but not via Spark SQL.

b.      Spark SQL can append rows into Kudu table.
However, I did not find documentation for how to tell Spark SQL to add range partitions when

c.      Predicates are pushed by Spark SQL down to Kudu.
However, aggregations are not. All qualified rows are returned to Spark SQL instead of each
tablet will return a partially aggregated (and much smaller) rows set.

3.      Primary key is mandatory even when rows will never be updated or deleted (historical
data is purged by dropping entire range partition).
Apache Kudu wish list

*        List partitions - all audit reports must filter on audit policy (equal or IN list
Currently, I can use hash partitioning on policy.
However, number of buckets cannot be changed after table creation.
I would like any new policy to automatically create a new partition.

*        Interval based range partitions - let me just define the interval in micro seconds
(86,400,000,000 for daily partitions).
Then Kudu could add partitions automatically when rows for a new day arrive.

*        Automatic compression of dictionary pages when string length > 100

*        Allow RLE encoding of the indexes in the dictionary of string columns.

*        Allow delta encoding for sequences and timestamp columns.

*        Aggregation push down.

Ehud Eshet | Senior Researcher
Ehud.Eshet@imperva.com<mailto:Ehud.Eshet@imperva.com> | o: +972 3-684-0114 | m: +972

View raw message