phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-3560) Aggregate query performance is worse with encoded columns for schema with large number of columns
Date Wed, 11 Jan 2017 06:03:59 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817308#comment-15817308
] 

Lars Hofhansl edited comment on PHOENIX-3560 at 1/11/17 6:03 AM:
-----------------------------------------------------------------

The FirstKeyOnlyFilter would still work and be effective, it's just that HBase cannot effectively
seek over as many bytes as before (in the simple COUNT\(*) case)

Imagine a row with 10000 columns, each 50 bytes. The total size would be 500KB. In the count
start case we can use the FirstKeyOnlyFilter. Without encoding, HBase loads the first block,
which will be (by default) be as most 65K + 499bytes (let's just say 64K)... So it will load
the block, look at the first key of the first KeyValue and then seek to the next row.. I.e.
the first KeyValue of the next row, now it can seek past the rest of the 500KB without ever
loading those blocks.

In the encoded case, the first block would be 500KB in size, since HBase will not break up
a KeyValue between blocks, so HBase has to load the 500KB, in order to read the first key
of the first KeyValue.

I do not see a way out of this, other than saying that this is fairly constructed case.
The default blocksize is 64KB, the default maximum KeyValue (Cell) size is 1MB. So if the
row size fall between these size simple scans like COUNT\(*) might be slower.

[~samarthjain], how is the encoding dealing with the 1MB limit? Does it (1) simply fail, or
will it (2) split the encoding into multiple Cells accordingly? If the latter, one could simply
do at smaller sizes.



was (Author: lhofhansl):
The FirstKeyOnlyFilter would still work and be effective, it's just that HBase cannot effectively
seek over as many bytes as before (in the simple COUNT\(*) case)

Imagine a row with 10000 columns, each 50 bytes. The total size would be 500KB. In the count
start case we can use the FirstKeyOnlyFilter. Without encoding, HBase loads the first block,
which will be (by default) be as most 65K + 499bytes (let's just say 64K)... So it will load
the block, look at the first key of the first KeyValue and then seek to the next row.. I.e.
the first KeyValue of the next row, now it can seek past the rest of the 500KB without ever
loading those blocks.

In the encoded case, the first block would be 500KB in size, since HBase will not break up
a KeyValue between blocks, so HBase has to load the 500KB, in order to read the first key
of the first KeyValue.

I do not see a way out of this, other than saying that this is fairly constructed case.
The default blocksize is 64KB, the default maximum KeyValue (Cell) size is 1MB. So if the
row size fall between these size simple scans like COUNT(*) might be slower.

[~samarthjain], how is the encoding dealing with the 1MB limit? Does it (1) simply fail, or
will it (2) split the encoding into multiple Cells accordingly? If the latter, one could simply
do at smaller sizes.


> Aggregate query performance is worse with encoded columns for schema with large number
of columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-3560
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3560
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Mujtaba Chohan
>            Assignee: Samarth Jain
>             Fix For: 4.10.0
>
>         Attachments: DataGenerator.java, PHOENIX-3565.patch
>
>
> Schema with 5K columns
> {noformat}
> create table (k1 integer, k2 integer, c1 varchar ... c5000 varchar CONSTRAINT PK PRIMARY
KEY (K1, K2)) 
> VERSIONS=1, MULTI_TENANT=true, IMMUTABLE_ROWS=true
> {noformat}
> In this test, there are no null columns and each column contains 200 chars i.e. 1MB of
data per row.
> Count * aggregation is about 5X slower with encoded columns when compared to table non-encoded
columns using the same schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message