Return-Path: Delivered-To: apmail-hive-user-archive@www.apache.org Received: (qmail 69638 invoked from network); 25 Jan 2011 16:27:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Jan 2011 16:27:34 -0000 Received: (qmail 49194 invoked by uid 500); 25 Jan 2011 16:27:34 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 49137 invoked by uid 500); 25 Jan 2011 16:27:33 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 49129 invoked by uid 99); 25 Jan 2011 16:27:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jan 2011 16:27:33 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: unknown (nike.apache.org: error in processing during lookup of Guy.Doulberg@conduit.com) Received: from [64.78.22.19] (HELO EXHUB017-4.exch017.msoutlookonline.net) (64.78.22.19) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jan 2011 16:27:24 +0000 Received: from EXVMBX017-12.exch017.msoutlookonline.net ([64.78.22.53]) by EXHUB017-4.exch017.msoutlookonline.net ([64.78.22.19]) with mapi; Tue, 25 Jan 2011 08:27:03 -0800 From: Guy Doulberg To: "user@hive.apache.org" Date: Tue, 25 Jan 2011 08:25:36 -0800 Subject: Distinct in hive Thread-Topic: Distinct in hive Thread-Index: Acu8rH9/kBFBEvNLRT2fHxAC5BjpnA== Message-ID: <6AB151AD074C18409E0CA3CD8D43123029FE70777C@EXVMBX017-12.exch017.msoutlookonline.net> Accept-Language: he-IL, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: he-IL, en-US Content-Type: multipart/alternative; boundary="_000_6AB151AD074C18409E0CA3CD8D43123029FE70777CEXVMBX01712ex_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_6AB151AD074C18409E0CA3CD8D43123029FE70777CEXVMBX01712ex_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hey, We made a query in hive, that calculates the number of distinct values in a= group by. On small portion of data it worked well, however when we ran the query over= large portion of data, we failed because OutOfMemory in some of the reduce= rs. We wonder how is the distinct operator works in HIVE? Does it use some sort= of data structure that its size is proportional to the number of distinct = values? Many thanks --_000_6AB151AD074C18409E0CA3CD8D43123029FE70777CEXVMBX01712ex_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hey,

We made a query in hive, that calculates the number of distinct values in a &n= bsp;group by.

On small portion of data it worked well, however when we ran the query over la= rge portion of data, we failed because OutOfMemory in some of the reducers.

 

We wonder how is the distinct operator works in HIVE? Does it use some sort of data structure that its size is proportional to the number of distinct valu= es?

 

Many thanks

 

 

--_000_6AB151AD074C18409E0CA3CD8D43123029FE70777CEXVMBX01712ex_--