Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0145311A26 for ; Thu, 4 Sep 2014 23:16:25 +0000 (UTC) Received: (qmail 72176 invoked by uid 500); 4 Sep 2014 23:16:24 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 72025 invoked by uid 500); 4 Sep 2014 23:16:23 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 71763 invoked by uid 500); 4 Sep 2014 23:16:23 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 71759 invoked by uid 99); 4 Sep 2014 23:16:23 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2014 23:16:23 +0000 Date: Thu, 4 Sep 2014 23:16:23 +0000 (UTC) From: "Ankit Kamboj (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-7989) Optimize Windowing function performance for row frames MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Kamboj updated HIVE-7989: ------------------------------- Component/s: PTF-Windowing > Optimize Windowing function performance for row frames > ------------------------------------------------------ > > Key: HIVE-7989 > URL: https://issues.apache.org/jira/browse/HIVE-7989 > Project: Hive > Issue Type: Improvement > Components: PTF-Windowing > Affects Versions: 0.13.0 > Reporter: Ankit Kamboj > > To find aggregate value for each row, current windowing function implementation creates a new aggregation buffer for each row, iterates over all the rows in respective window frame, puts them in buffer and then finds the aggregated value. This causes bottleneck for partitions with huge number of rows because this process runs in n-square complexity (n being rows in a partition) for each partition. So, if there are multiple partitions in a dataset, each with millions of rows, aggregation for all rows will take days to finish. > There is scope of optimization for row frames, for following cases: > a) For UNBOUNDED PRECEDING start and bounded end: Instead of iterating on window frame again for each row, we can slide the end one row at a time and aggregate, since we know the start is fixed for each row. This will have running time linear to the size of partition (O(n)). > b) For bounded start and UNBOUNDED FOLLOWING end: Instead of iterating on window frame again for each row, we can slide the start one row at a time and aggregate in reverse, since we know the end is fixed for each row. This will have running time linear to the size of partition (O(n)). > Also, In general for both row and value frames, we don't need to iterate over the range and re-create aggregation buffer if the start as well as end remain same. Instead, can re-use the previously created aggregation buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)