Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88270115A1 for ; Tue, 19 Aug 2014 15:19:20 +0000 (UTC) Received: (qmail 96297 invoked by uid 500); 19 Aug 2014 15:19:20 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 96222 invoked by uid 500); 19 Aug 2014 15:19:20 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 96203 invoked by uid 500); 19 Aug 2014 15:19:19 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 96200 invoked by uid 99); 19 Aug 2014 15:19:19 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Aug 2014 15:19:19 +0000 Date: Tue, 19 Aug 2014 15:19:19 +0000 (UTC) From: "Mostafa Mokhtar (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-7664) VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102305#comment-14102305 ] Mostafa Mokhtar commented on HIVE-7664: --------------------------------------- [~navis] Can you please add a code review. > VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU > ------------------------------------------------------------------------------------------------- > > Key: HIVE-7664 > URL: https://issues.apache.org/jira/browse/HIVE-7664 > Project: Hive > Issue Type: Bug > Affects Versions: 0.13.1 > Reporter: Mostafa Mokhtar > Fix For: 0.14.0 > > Attachments: HIVE-7664.1.patch.txt > > > In a Group by heavy vectorized Reducer vertex 25% of CPU is spent in VectorizedBatchUtil.addRowToBatchFrom(). > Looked at the code of VectorizedBatchUtil.addRowToBatchFrom and it looks like it wasn't optimized for Vectorized processing. > addRowToBatchFrom is called for every row and for each row and every column in the batch getPrimitiveCategory is called to figure the type of each column, column types are stored in a HashMap, for VectorGroupByOperator columns types won't change between batches, so column types shouldn't be looked up for every row. > I recommend storing the column type in StructObjectInspector so that other components can leverage this optimization. > Also addRowToBatchFrom has a case statement for every row and every column used for type casting I recommend encapsulating the type logic in templatized methods. > {code} > Stack Trace Sample Count Percentage(%) > VectorizedBatchUtil.addRowToBatchFrom 86 26.543 > AbstractPrimitiveObjectInspector.getPrimitiveCategory() 34 10.494 > LazyBinaryStructObjectInspector.getStructFieldData 25 7.716 > StandardStructObjectInspector.getStructFieldData 4 1.235 > {code} > The query used : > {code} > select > ss_sold_date_sk > from > store_sales > where > ss_sold_date between '1998-01-01' and '1998-06-01' > group by ss_item_sk , ss_customer_sk , ss_sold_date_sk > having sum(ss_list_price) > 50000000000000; > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)