Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: error (athena.apache.org: local policy)
MIME-Version: 1.0
From: =?UTF-8?Q?Philippe_Kern=C3=A9vez?= <pkernevez@octo.com>
Date: Tue, 17 Feb 2015 18:31:17 +0100
Message-ID: 
 <CAGm8jbAktEOs4eYBS9TjnB=xpGWWbxd_dhpk-GYgQUn3TbWV7A@mail.gmail.com>
Subject: Remove duplicated rows
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=001a11340c6847b5d7050f4c13ea

--001a11340c6847b5d7050f4c13ea
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,

I have a table (named DEDUPLICATED) that contains about 1 billions rows.
Each day I receive a new file  with about 5 millions rows. About 10% of
those rows are duplicated (duplication occur inside a daily file but also
between files).
There are about 30 fields in the files.

As for now I deduplicate all the data every day with the following request =
:

  INSERT OVERWRITE TABLE DEDUPLICATED
    SELECT cl.*
    FROM (
        SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
        FROM DAILY d  -- DAILY is an external table that contains all the
daily files
        ) cl
    WHERE cl.pos =3D 1

On the mailing list I saw another approach base on a "group by KEY" request
and use a 'select MAX(xxx)' for all non-key fields.

My first question is : which of the both seems to be better ?
(the second one is quite harder to maintain as all the fields should be
explicitly written in the request).


The second question is : what is the best way to do the deduplication and
import on a incremental approach ?
Something like that ?
  INSERT TABLE DEDUPLICATED
    SELECT cl.*
    FROM (
        SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
        FROM LAST_DAILY_FILE d     -- ONLY the last file
        ) cl
    WHERE cl.pos =3D 1    -- REQUIRED to remove all the duplication inside
the last file
   AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
between the last file and all the existing files

And the last question : for the last request, does an index on KEY help
with hive as it can help on a classical relational database ?

Regards,
Philippe


--=20
Philippe Kern=C3=A9vez


Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

--001a11340c6847b5d7050f4c13ea
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>I have a table (named DEDUPLICATED)=
 that contains about 1 billions rows.</div><div>Each day I receive a new fi=
le =C2=A0with about 5 millions rows. About 10% of those rows are duplicated=
 (duplication occur inside a daily file but also between files).</div><div>=
There are about 30 fields in the files.</div><div><br></div><div>As for now=
 I deduplicate all the data every day with the following request :</div><di=
v><br></div><div>=C2=A0 INSERT OVERWRITE TABLE DEDUPLICATED<br></div><div>=
=C2=A0 =C2=A0 SELECT cl.*=C2=A0</div><div>=C2=A0 =C2=A0 FROM (</div><div>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.K=
EY) AS pos=C2=A0</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 FROM DAILY d =C2=A0-=
- DAILY is an external table that contains all the daily files</div><div>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ) cl</div><div>=C2=A0 =C2=A0 WHERE cl.pos =3D 1=
 =C2=A0 =C2=A0</div><div>=C2=A0</div><div>On the mailing list I saw another=
 approach base on a &quot;group by KEY&quot; request and use a &#39;select =
MAX(xxx)&#39; for all non-key fields.</div><div><br></div><div>My first que=
stion is : which of the both seems to be better ?</div><div>(the second one=
 is quite harder to maintain as all the fields should be explicitly written=
 in the request).</div><div><br></div><div><br></div><div><br></div><div>Th=
e second question is : what is the best way to do the deduplication and imp=
ort on a incremental approach ?</div><div>Something like that ?</div><div><=
div>=C2=A0 INSERT TABLE DEDUPLICATED</div><div>=C2=A0 =C2=A0 SELECT cl.*=C2=
=A0</div><div>=C2=A0 =C2=A0 FROM (</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 SE=
LECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos=C2=A0</div><div>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 FROM LAST_DAILY_FILE d =C2=A0 =C2=A0 -- ONLY the l=
ast file</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ) cl</div><div>=C2=A0 =C2=A0=
 WHERE cl.pos =3D 1 =C2=A0 =C2=A0-- REQUIRED to remove all the duplication =
inside the last file</div></div><div>=C2=A0 =C2=A0AND cl.KEY NOT IN SELECT =
KEY FROM DEDUPLICATED =C2=A0-- remove duplication between the last file and=
 all the existing files</div><div><br></div><div>And the last question : fo=
r the last request, does an index on KEY help with hive as it can help on a=
 classical relational database ?</div><div><br></div><div>Regards,</div><di=
v>Philippe</div><div><br></div><div><br clear=3D"all"><div><br></div>-- <br=
><div class=3D"gmail_signature">
Philippe Kern=C3=A9vez<br><br><br><br>Directeur technique (Suisse),
<br><a href=3D"mailto:pkernevez@octo.com" target=3D"_blank">pkernevez@octo.=
com</a><br>+41 79 888 33 32<br>
<br>Retrouvez OCTO sur OCTO Talk : <a href=3D"http://blog.octo.com" target=
=3D"_blank">http://blog.octo.com</a><br>OCTO Technology <a href=3D"http://w=
ww.octo.com" target=3D"_blank">http://www.octo.com</a>

</div>
</div></div>

--001a11340c6847b5d7050f4c13ea--