phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-4550) Allow declaration of max columns on base physical table
Date Thu, 25 Jan 2018 01:07:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338493#comment-16338493
] 

James Taylor edited comment on PHOENIX-4550 at 1/25/18 1:06 AM:
----------------------------------------------------------------

I think there's a theoretical problem with updatable views in general. There could be multiple
views for the same row. This is arguably a situation we may want to prevent, but we're not
doing that today. For example, say you have the following hierarchy:

T (A, B, C)
V1 (D, E) FROM T WHERE A = 1
V2 (F, G) FROM T WHERE A = 1 and B = 2

The same rows in table T could be in both V1 and V2. So then T would occupy positions 1-3,
V1 would occupy positions 4-5, and V2 would occupy positions 6-7. Depending on which view
you updated through, you'd have nulls in either positions 4-5 or 6-7. In cases like this,
there's no advantage to having a mapping or declaring the max number of columns.

If we detect this and disallow it at creation time, we can pursue this JIRA. I've filed PHOENIX-4555
for that. In reality, we don't have use cases in which view rows overlap, so this is kind
of a theoretical problem.

So assuming the views aren't overlapping, how would you deal with columns that have been dropped?
Also, are you thinking to push this map through every SingleCellColumnExpression? Wouldn't
that be expensive, especially if there are many columns and many column references in a query?

With the alternative, preallocating a fixed number of columns, you'd need to push the preallocated
number plus the original starting column qualifier of a view to figure out the array position.
The downside is that the preallocated columns would be wasteful.

Not sure that the map idea solves the issue of when a column is added to a base table since
it needs to be in the same array position for all rows.


was (Author: jamestaylor):
I think there's a theoretical problem with updatable views in general. There could be multiple
views for the same row. This is arguably a situation we may want to prevent, but we're not
doing that today. For example, say you have the following hierarchy:

T (A, B, C)
V1 (D, E) FROM T WHERE A = 1
V2 (F, G) FROM T WHERE A = 1 and B = 2

The same rows in table T could be in both V1 and V2. So then T would occupy positions 1-3,
V1 would occupy positions 4-5, and V2 would occupy positions 6-7. Depending on which view
you updated through, you'd have nulls in either positions 4-5 or 6-7. In cases like this,
there's no advantage to having a mapping or declaring the max number of columns.

If we detect this and disallow it at creation time, we can pursue this JIRA. I'll file a separate
JIRA for that. In reality, we don't have use cases in which view rows overlap, so this is
kind of a theoretical problem.

So assuming the views aren't overlapping, how would you deal with columns that have been dropped?
Also, are you thinking to push this map through every SingleCellColumnExpression? Wouldn't
that be expensive, especially if there are many columns and many column references in a query?

With the alternative, preallocating a fixed number of columns, you'd need to push the preallocated
number plus the original starting column qualifier of a view to figure out the array position.
The downside is that the preallocated columns would be wasteful.

Not sure that the map idea solves the issue of when a column is added to a base table since
it needs to be in the same array position for all rows.

> Allow declaration of max columns on base physical table
> -------------------------------------------------------
>
>                 Key: PHOENIX-4550
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4550
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Priority: Major
>
> By declaring the max number of columns on a base table, we can optimize the storage for
SINGLE_CELL_ARRAY_WITH_OFFSETS by not storing null values for the columns preceding the initial
column of a view. This will make a huge difference in storage when you have a base table with
many views. For example:
> {code}
> -- Declare that the base table will have no more than 10 columns
> CREATE IMMUTABLE TABLE base (k1 VARCHAR, prefix CHAR(3) v1 DATE,
>     CONSTRAINT pk PRIMARY KEY (k1, prefix))
>     MULTI_TENANT = true,
>     MAX_COLUMNS = 10;
> CREATE VIEW v1(k2 VARCHAR PRIMARY KEY, v2 VARCHAR, v3 VARCHAR)
>     AS SELECT * FROM base WHERE prefix = 'A00';
> CREATE VIEW v2(k2 VARCHAR PRIMARY KEY, v2 VARCHAR, v3 VARCHAR);
>     AS SELECT * FROM base WHERE prefix = 'A10';
> ...
> {code}
> As the number of views grow, the difference between the base table column encoding (column
#1) and the starting column number of the view (since the starting offset is determined by
an incrementing value on the base table) will increase. This bloats the storage as we need
to store null values for column encodings between the base table column and the starting column
of the view.
> Instead, we'll pass through the MAX_COLUMNS value for queries and anything column encoding
less than this we know it'll be at the start. Anything greater and we'll start the search
from <column encoding> - <minimum view column encoding>.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message