Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9834410BCA for ; Fri, 14 Feb 2014 22:09:49 +0000 (UTC) Received: (qmail 68439 invoked by uid 500); 14 Feb 2014 22:09:48 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 68381 invoked by uid 500); 14 Feb 2014 22:09:47 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 68373 invoked by uid 99); 14 Feb 2014 22:09:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Feb 2014 22:09:47 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of remusr@microsoft.com designates 207.46.163.206 as permitted sender) Received: from [207.46.163.206] (HELO na01-bl2-obe.outbound.protection.outlook.com) (207.46.163.206) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Feb 2014 22:09:39 +0000 Received: from BY2PR03MB238.namprd03.prod.outlook.com (10.242.37.18) by BY2PR03MB090.namprd03.prod.outlook.com (10.255.241.158) with Microsoft SMTP Server (TLS) id 15.0.883.10; Fri, 14 Feb 2014 22:09:04 +0000 Received: from BY2PR03MB412.namprd03.prod.outlook.com (10.141.141.25) by BY2PR03MB238.namprd03.prod.outlook.com (10.242.37.18) with Microsoft SMTP Server (TLS) id 15.0.878.16; Fri, 14 Feb 2014 22:09:02 +0000 Received: from BY2PR03MB412.namprd03.prod.outlook.com ([10.141.141.25]) by BY2PR03MB412.namprd03.prod.outlook.com ([10.141.141.25]) with mapi id 15.00.0878.008; Fri, 14 Feb 2014 22:09:02 +0000 From: Remus Rusanu To: "dev@hive.apache.org" CC: "xuefu@apache.org" , "Eric Hanson (BIG DATA)" Subject: RE: A question about the derivation of intermediate sum field for decimal average aggregates Thread-Topic: A question about the derivation of intermediate sum field for decimal average aggregates Thread-Index: Ac8peiutU7iy0/YYQJ+StopTkuk+gAALpFcAAAoHvhA= Date: Fri, 14 Feb 2014 22:09:02 +0000 Message-ID: <44e894c2083d4545ab5dd2989f099349@BY2PR03MB412.namprd03.prod.outlook.com> References: <26098f415a4e419cb002d6f1fa7a44e7@BY2PR03MB412.namprd03.prod.outlook.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [92.83.100.45] x-forefront-prvs: 01221E3973 x-forefront-antispam-report: SFV:NSPM;SFS:(10009001)(6009001)(189002)(199002)(377454003)(52604005)(51704005)(164054003)(13464003)(24454002)(49866001)(50986001)(47976001)(76576001)(74366001)(69226001)(47736001)(87266001)(74502001)(76786001)(47446002)(19580395003)(19580405001)(80976001)(85306002)(76796001)(4396001)(74662001)(31966008)(56776001)(54356001)(54316002)(74316001)(53806001)(76482001)(86612001)(46102001)(77982001)(74876001)(51856001)(94316002)(59766001)(95666001)(87936001)(95416001)(92566001)(81542001)(86362001)(81686001)(56816005)(2656002)(90146001)(81342001)(65816001)(81816001)(94946001)(79102001)(66066001)(93516002)(33646001)(93136001)(83322001)(80022001)(74706001)(63696002)(85852003)(83072002)(24736002);DIR:OUT;SFP:1101;SCL:1;SRVR:BY2PR03MB238;H:BY2PR03MB412.namprd03.prod.outlook.com;CLIP:92.83.100.45;FPR:6C1FF3E4.9FDA544E.21F53D7B.46F5CAEF.20360;MLV:sfv;PTR:InfoNoRecords;MX:1;A:1;LANG:en; Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: microsoft.com X-Virus-Checked: Checked by ClamAV on apache.org Hi Xuefu, I do not have any particular use case in mind. I've noticed the problem whe= n I implemented the vectorized AVG for decimal, which must match your imple= mentation (since we vectorized only the map side operator, it better produc= e the output expected by the reduce side...). I thought that since we alte= r the precission/scale for the result, we may as well alter it for the inte= rmediate sum field. But if this complicates the use of object inspectors an= d introduces maintenance risks, then is probably not worth it. Thanks, ~Remus -----Original Message----- From: Xuefu Zhang [mailto:xzhang@cloudera.com]=20 Sent: Friday, February 14, 2014 7:18 PM To: dev@hive.apache.org Cc: xuefu@apache.org; Eric Hanson (BIG DATA) Subject: Re: A question about the derivation of intermediate sum field for = decimal average aggregates Remus, Thanks for looking into this. You're right that sum() result doesn't increa= se the scale, but have you seen that sum UDF returns wrong scale? As to the implementation of avg UDF, the object inspector for sum field is = initialized with a scale + 4, which might not be necessary, but perhaps har= mless. The same object inspector is also used for the average result, which= gives correct type. I guess it's possible to separate this into two object= inspectors, one for sum field and one for the avg result, but the differen= ce might be subtle and questionable. This is because the data may not compl= y to the metadata specified for Hive tables. Thus, I'm not sure if truncati= ng data before it's summed if the right behavior. Do you have a use case that suggests one is better than the other? --Xuefu On Fri, Feb 14, 2014 at 3:55 AM, Remus Rusanu wrote: > Hi, > > With HIVE-5872 the intermediate sum field for decimal aggregates was=20 > changed to increase scale by 4. I understand the reasoning for having=20 > accurate precision/scale for the aggregate output. However, for the=20 > intermediate sum field of AVG, I believe we should increase precision=20 > w/o increasing scale. The sum can grow large, but cannot increase=20 > digits in the fractional part, so we should increase the precision of=20 > the sum, but not the scale. When sum is divided by count to get the=20 > average on the reduce side then we should indeed project value with highe= r scale. > > Opinions? > > Thanks, > ~Remus > >