hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe Kernévez <pkerne...@octo.com>
Subject Re: Remove duplicated rows
Date Wed, 18 Feb 2015 09:32:45 GMT
Hi Dev,

> "Only suggestion is that please don't compress the files that are linked
in the external table" .
Why do you suggest that ? Presently, I divide my import time by 2 by using
'gz' files instead of the uncompressed 'csv'.

> "You may want to leverage dynamic partition"
I read you post, and I wonder if it can scale. It seems that you have
hourly and daily aggregation. More or less 24*365=8'800 raw per year.
We have about 2 billions (2E9) raws per year.
Do you know if you approach can scale on a such factor ?
I also wonder about the impact of the partition columns on the map reduce
job, they will probably produce a big number of files to produce those
partitions.


Regards,
Philippe



On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra <devopam@gmail.com> wrote:

> dear Philippe,
> I asked specifically the version so as to understand what built-in
> features you can leverage to your benefit.
>
> There are two ways that I can think of - to help you in your case :
> 1. Use RANK as you have already noted in your original note - this will be
> faster and more appropriate . Only suggestion is that please don't compress
> the files that are linked in the external table.
>
> 2. You may want to leverage dynamic partition feature to your aid for
> managing the duplicate records . I like it and use it extensively now a
> days after 0.13 onward. Brief concept : dynamically partition the table on
> one low skew column + 'key' column . Then simply INSERT OVERWRITE the
> 'delta' and it will seamlessly overwrite only the affected rows and do the
> redistribution of data in the table internally without you having to bother
> about the code or the burden to do it. You may refer to this blog I wrote
> quite some time back : http://linkd.in/1Fq3wdb
> This technique will cost you a little overhead time with the MR job
> getting kicked off and all , but is painless and seamless , so I like it
> this way.
>
> regards
> Dev
>
> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez <pkernevez@octo.com>
> wrote:
>
>> Hi Dev,
>>
>> I'm using hive 0.14.
>>
>> Regards,
>>
>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra <devopam@gmail.com>
>> wrote:
>>
>>> Can you please state the hive version
>>>
>>> regards
>>> Dev
>>> +91 958 305 9899
>>>
>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez <pkernevez@octo.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>>> Each day I receive a new file  with about 5 millions rows. About 10% of
>>> those rows are duplicated (duplication occur inside a daily file but also
>>> between files).
>>> There are about 30 fields in the files.
>>>
>>> As for now I deduplicate all the data every day with the following
>>> request :
>>>
>>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>>     SELECT cl.*
>>>     FROM (
>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>         FROM DAILY d  -- DAILY is an external table that contains all
>>> the daily files
>>>         ) cl
>>>     WHERE cl.pos = 1
>>>
>>> On the mailing list I saw another approach base on a "group by KEY"
>>> request and use a 'select MAX(xxx)' for all non-key fields.
>>>
>>> My first question is : which of the both seems to be better ?
>>> (the second one is quite harder to maintain as all the fields should be
>>> explicitly written in the request).
>>>
>>>
>>>
>>> The second question is : what is the best way to do the deduplication
>>> and import on a incremental approach ?
>>> Something like that ?
>>>   INSERT TABLE DEDUPLICATED
>>>     SELECT cl.*
>>>     FROM (
>>>         SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>         FROM LAST_DAILY_FILE d     -- ONLY the last file
>>>         ) cl
>>>     WHERE cl.pos = 1    -- REQUIRED to remove all the duplication inside
>>> the last file
>>>    AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
>>> between the last file and all the existing files
>>>
>>> And the last question : for the last request, does an index on KEY help
>>> with hive as it can help on a classical relational database ?
>>>
>>> Regards,
>>> Philippe
>>>
>>>
>>>
>>> --
>>> Philippe Kernévez
>>>
>>>
>>>
>>> Directeur technique (Suisse),
>>> pkernevez@octo.com
>>> +41 79 888 33 32
>>>
>>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>>> OCTO Technology http://www.octo.com
>>>
>>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkernevez@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>



-- 
Philippe Kernévez



Directeur technique (Suisse),
pkernevez@octo.com
+41 79 888 33 32

Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
OCTO Technology http://www.octo.com

Mime
View raw message