hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <>
Subject Re: Dealing with duplicate rows in Hive
Date Wed, 02 Oct 2013 06:21:33 GMT
So you have 50 columns and out of them you want to use 9 columns for
finding unique rows?

am i correct in assuming that you want to make a key of combination of
these 9 columns so that you have just one row for a single combination of
these 9 columns ?

On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <> wrote:

> Hi,
> I am using Hive 8.1.8 in EMR.
> We have an extremely large table (~50 columns) where the uniqueness key is
> a combination of 9 different columns. I want to filter out any duplicate
> rows based on these 9 columns while retaining the ability to select other
> columns on an ad hoc basis. I don’t expect rows with the same uniqueness
> key to have different data, so I guess this can be generalized to just
> filtering out duplicate rows.
> My initial instinct was to do a “select distinct *” on the table and save
> the results into another table, but it appears that Hive does not support
> “distinct *”. Furthermore, Hive will apply distinct to every column in the
> select statement, so something like “select distinct(a), b” does not work
> either.
> The only option I could think of from here was to explicitly state all
> columns of the table inside the distinct statement, but this seems
> unnecessarily messy (again, the table contains more than 50 columns).
> Has anyone ran into a similar issue? Any insight would be appreciated.
> Thanks,
> Philo

Nitin Pawar

View raw message