hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antal Sinkovits (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-20523) Improve table statistics for Parquet format
Date Tue, 19 Feb 2019 14:35:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771971#comment-16771971
] 

Antal Sinkovits commented on HIVE-20523:
----------------------------------------

Hi [~george.pachitariu]

Thanks for the answer. I understood your code, this was the initial approach I was planing
to do as well. :)
So the issue I see is that you only implemented the write (serialize) path. But the read part
(deserialize) remains as is. 
Let me give an example, which might put some light on what I mean.

For the setup, I've applied your patch on top of master and nothing else.

create table case1 (col string) stored as parquet;
insert into case1 values("This is a test string"); // -> rawDataSize: 105   
analyze table case1 compute statistics; // -> rawDataSize: 144
analyze table case1 compute statistics for columns; // -> rawDataSize: 1

Now if I start to mix these, things gets more interesting, because your change only calculates
for the data it writes. So for example if I run these commands:
create table case2 (col string) stored as parquet;
insert into case2 values("This is a test string"); // -> rawDataSize: 105
analyze table case2 compute statistics for columns; // -> rawDataSize: 1
insert into case2 values("This is a test string"); // -> rawDataSize: 106 (1+105)

Thats why I think, there should be a single source of truth.
I've checked with the parquet team, and unfortunately, parquet (unlike ORC) doesn't provide
any api on the writer side to get the total size. It's there only in the reader, because the
value is internal in parquet, and only gets written, when the file is closed.
So it makes sense, to use this, as our single source of truth. HIVE-20079  was done by [~aihuaxu]
I don't want to take credit for this. That change moves the stat calculation from the serde
to the writer, and when the writer closes the file, and parquet writes the footer, it reads
from the closed file, and updates the stats.
This fixes the write path.

HIVE-21284 was done by me, which fixes the read portion to use the same footer value, on analyze
compute statistics for columns.
This way, the calculated value stays consistent, no matter which path you take.

Let me know, if this makes sense or not. Thanks.

> Improve table statistics for Parquet format
> -------------------------------------------
>
>                 Key: HIVE-20523
>                 URL: https://issues.apache.org/jira/browse/HIVE-20523
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Minor
>         Attachments: HIVE-20523.1.patch, HIVE-20523.10.patch, HIVE-20523.11.patch, HIVE-20523.12.patch,
HIVE-20523.2.patch, HIVE-20523.3.patch, HIVE-20523.4.patch, HIVE-20523.5.patch, HIVE-20523.6.patch,
HIVE-20523.7.patch, HIVE-20523.8.patch, HIVE-20523.9.patch, HIVE-20523.patch
>
>
> Right now, in the table basic statistics, the *raw data size* for a row with any data
type in the Parquet format is 1. This is an underestimated value when columns are complex
data structures, like arrays.
> Having tables with underestimated raw data size makes Hive assign less containers (mappers/reducers)
to it, making the overall query slower. 
> Heavy underestimation also makes Hive choose MapJoin instead of the ShuffleJoin that
can fail with OOM errors.
> In this patch, I compute the columns data size better, taking into account complex structures.
I followed the Writer implementation for the ORC format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message