hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philo Wang <phil...@gmail.com>
Subject Re: Dealing with duplicate rows in Hive
Date Wed, 02 Oct 2013 07:04:14 GMT
Yes, that is correct.


On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:

> So you have 50 columns and out of them you want to use 9 columns for
> finding unique rows?
>
> am i correct in assuming that you want to make a key of combination of
> these 9 columns so that you have just one row for a single combination of
> these 9 columns ?
>
>
> On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <philowy@gmail.com> wrote:
>
>> Hi,
>>
>> I am using Hive 8.1.8 in EMR.
>>
>> We have an extremely large table (~50 columns) where the uniqueness key
>> is a combination of 9 different columns. I want to filter out any duplicate
>> rows based on these 9 columns while retaining the ability to select other
>> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
>> key to have different data, so I guess this can be generalized to just
>> filtering out duplicate rows.
>>
>> My initial instinct was to do a “select distinct *” on the table and save
>> the results into another table, but it appears that Hive does not support
>> “distinct *”. Furthermore, Hive will apply distinct to every column in the
>> select statement, so something like “select distinct(a), b” does not work
>> either.
>>
>> The only option I could think of from here was to explicitly state all
>> columns of the table inside the distinct statement, but this seems
>> unnecessarily messy (again, the table contains more than 50 columns).
>>
>> Has anyone ran into a similar issue? Any insight would be appreciated.
>>
>> Thanks,
>> Philo
>>
>>
>
>
> --
> Nitin Pawar
>

Mime
View raw message