hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted
Date Thu, 12 Nov 2015 21:22:10 GMT

    [ https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002955#comment-15002955
] 

Sergey Shelukhin commented on HIVE-11583:
-----------------------------------------

This was committed a while ago... the test can be created in a separate JIRA if needed. I
don't have background on this issue, I bulk commented yesterday on a large list of  issues
whose title looks like a bug and that were committed to master but not to branch-1, obtained
via a script

> When PTF is used over a large partitions result could be corrupted
> ------------------------------------------------------------------
>
>                 Key: HIVE-11583
>                 URL: https://issues.apache.org/jira/browse/HIVE-11583
>             Project: Hive
>          Issue Type: Bug
>          Components: PTF-Windowing
>    Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
>         Environment: Hadoop 2.6 + Apache hive built from trunk
>            Reporter: Illya Yalovyy
>            Assignee: Illya Yalovyy
>            Priority: Critical
>             Fix For: 1.3.0, 2.0.0
>
>         Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually loaded. The second
split gets missed. The total count of the result dataset is correct, but some records are
missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A	25000
> -- B	20000
> -- C	5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key ORDER BY grp)
grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A	34296
> -- B	15704
> -- C	1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message