Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7BA24182F2 for ; Thu, 12 Nov 2015 21:38:12 +0000 (UTC) Received: (qmail 11304 invoked by uid 500); 12 Nov 2015 21:38:11 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 11202 invoked by uid 500); 12 Nov 2015 21:38:11 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 10964 invoked by uid 99); 12 Nov 2015 21:38:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Nov 2015 21:38:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id ACF8D2C0452 for ; Thu, 12 Nov 2015 21:38:11 +0000 (UTC) Date: Thu, 12 Nov 2015 21:38:11 +0000 (UTC) From: "Sergey Shelukhin (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002983#comment-15002983 ] Sergey Shelukhin commented on HIVE-11583: ----------------------------------------- You could generate it in the test by repeatedly cross joining. Or does the file have to be in a specific form that is not reproducible by the queries? > When PTF is used over a large partitions result could be corrupted > ------------------------------------------------------------------ > > Key: HIVE-11583 > URL: https://issues.apache.org/jira/browse/HIVE-11583 > Project: Hive > Issue Type: Bug > Components: PTF-Windowing > Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1 > Environment: Hadoop 2.6 + Apache hive built from trunk > Reporter: Illya Yalovyy > Assignee: Illya Yalovyy > Priority: Critical > Fix For: 1.3.0, 2.0.0 > > Attachments: HIVE-11583.patch > > > Dataset: > Window has 50001 record (2 blocks on disk and 1 block in memory) > Size of the second block is >32Mb (2 splits) > Result: > When the last block is read from the disk only first split is actually loaded. The second split gets missed. The total count of the result dataset is correct, but some records are missing and another are duplicated. > Example: > {code:sql} > CREATE TABLE ptf_big_src ( > id INT, > key STRING, > grp STRING, > value STRING > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; > LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO TABLE ptf_big_src; > SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc; > --- > -- A 25000 > -- B 20000 > -- C 5001 > --- > CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key ORDER BY grp) grp_num FROM ptf_big_src; > SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc; > -- > -- A 34296 > -- B 15704 > -- C 1 > --- > {code} > Counts by 'grp' are incorrect! -- This message was sent by Atlassian JIRA (v6.3.4#6332)