Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 34E6D17549 for ; Tue, 17 Feb 2015 17:33:12 +0000 (UTC) Received: (qmail 3960 invoked by uid 500); 17 Feb 2015 17:33:10 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 3898 invoked by uid 500); 17 Feb 2015 17:33:10 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 3888 invoked by uid 99); 17 Feb 2015 17:33:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2015 17:33:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.215.54] (HELO mail-la0-f54.google.com) (209.85.215.54) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2015 17:33:05 +0000 Received: by labge10 with SMTP id ge10so37058345lab.12 for ; Tue, 17 Feb 2015 09:31:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type; bh=IKsCJ5W7Dzkpv4cNgXhHmSxTZWo8C6V/YignvFnYrh4=; b=giD1fj1y7uLIsio6Oc9APrEKRE3b0gIq34H+BdMbbStEfNCgxZ80sGWDRj+RqIJ4Fp maWGE7lPMImZlXgHkeVWVO5ibSnbUenbyo05nQumG95dE+hZe84lxjG2APY8k2K6Ght4 AVV6x3/NZh73waj3SjR3nCnJIxzbKc4DYXI1glatAL180g9GuEOR0EOwlurXUtEbrTZF sVsHePtxSll4hJAsY2IAPLK8YqNFm+LtUbpiF2NtnNmftkv1Iak15xBxSW9uJceorYiJ O/Vxeuv9bANGdWjQEqZVI3LICsFHRaQg3VAknuDgkAV2XfNNYVRM8OFQ7sCYhRIuaswi PXTA== X-Gm-Message-State: ALoCoQlla8L+tiisbH62wlqXqeQE6Irnv7BqorvoDrttNLf+LYcCjI/SsoM8S5ZYLo8YNNQC/EKU X-Received: by 10.152.178.197 with SMTP id da5mr29463472lac.87.1424194298944; Tue, 17 Feb 2015 09:31:38 -0800 (PST) MIME-Version: 1.0 Received: by 10.113.3.44 with HTTP; Tue, 17 Feb 2015 09:31:17 -0800 (PST) From: =?UTF-8?Q?Philippe_Kern=C3=A9vez?= Date: Tue, 17 Feb 2015 18:31:17 +0100 Message-ID: Subject: Remove duplicated rows To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a11340c6847b5d7050f4c13ea X-Virus-Checked: Checked by ClamAV on apache.org --001a11340c6847b5d7050f4c13ea Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, I have a table (named DEDUPLICATED) that contains about 1 billions rows. Each day I receive a new file with about 5 millions rows. About 10% of those rows are duplicated (duplication occur inside a daily file but also between files). There are about 30 fields in the files. As for now I deduplicate all the data every day with the following request = : INSERT OVERWRITE TABLE DEDUPLICATED SELECT cl.* FROM ( SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos FROM DAILY d -- DAILY is an external table that contains all the daily files ) cl WHERE cl.pos =3D 1 On the mailing list I saw another approach base on a "group by KEY" request and use a 'select MAX(xxx)' for all non-key fields. My first question is : which of the both seems to be better ? (the second one is quite harder to maintain as all the fields should be explicitly written in the request). The second question is : what is the best way to do the deduplication and import on a incremental approach ? Something like that ? INSERT TABLE DEDUPLICATED SELECT cl.* FROM ( SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos FROM LAST_DAILY_FILE d -- ONLY the last file ) cl WHERE cl.pos =3D 1 -- REQUIRED to remove all the duplication inside the last file AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED -- remove duplication between the last file and all the existing files And the last question : for the last request, does an index on KEY help with hive as it can help on a classical relational database ? Regards, Philippe --=20 Philippe Kern=C3=A9vez Directeur technique (Suisse), pkernevez@octo.com +41 79 888 33 32 Retrouvez OCTO sur OCTO Talk : http://blog.octo.com OCTO Technology http://www.octo.com --001a11340c6847b5d7050f4c13ea Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

I have a table (named DEDUPLICATED)= that contains about 1 billions rows.
Each day I receive a new fi= le =C2=A0with about 5 millions rows. About 10% of those rows are duplicated= (duplication occur inside a daily file but also between files).
= There are about 30 fields in the files.

As for now= I deduplicate all the data every day with the following request :

=C2=A0 INSERT OVERWRITE TABLE DEDUPLICATED
= =C2=A0 =C2=A0 SELECT cl.*=C2=A0
=C2=A0 =C2=A0 FROM (
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.K= EY) AS pos=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 FROM DAILY d =C2=A0-= - DAILY is an external table that contains all the daily files
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 ) cl
=C2=A0 =C2=A0 WHERE cl.pos =3D 1= =C2=A0 =C2=A0
=C2=A0
On the mailing list I saw another= approach base on a "group by KEY" request and use a 'select = MAX(xxx)' for all non-key fields.

My first que= stion is : which of the both seems to be better ?
(the second one= is quite harder to maintain as all the fields should be explicitly written= in the request).



Th= e second question is : what is the best way to do the deduplication and imp= ort on a incremental approach ?
Something like that ?
<= div>=C2=A0 INSERT TABLE DEDUPLICATED
=C2=A0 =C2=A0 SELECT cl.*=C2= =A0
=C2=A0 =C2=A0 FROM (
=C2=A0 =C2=A0 =C2=A0 =C2=A0 SE= LECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos=C2=A0
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 FROM LAST_DAILY_FILE d =C2=A0 =C2=A0 -- ONLY the l= ast file
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ) cl
=C2=A0 =C2=A0= WHERE cl.pos =3D 1 =C2=A0 =C2=A0-- REQUIRED to remove all the duplication = inside the last file
=C2=A0 =C2=A0AND cl.KEY NOT IN SELECT = KEY FROM DEDUPLICATED =C2=A0-- remove duplication between the last file and= all the existing files

And the last question : fo= r the last request, does an index on KEY help with hive as it can help on a= classical relational database ?

Regards,
Philippe



--
Philippe Kern=C3=A9vez



Directeur technique (Suisse),
pkernevez@octo.= com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com
--001a11340c6847b5d7050f4c13ea--