hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikael Öhman <mikael_u...@yahoo.se>
Subject SV: Duplicate rows when using group by in subquery
Date Tue, 17 Sep 2013 06:24:41 GMT
Thank you for the information. Just to be clear, it is not that I have manually restricted
the job to run using only a single mapreduce job, but it incorrectly assumes one job is enough?
 
I will get back with results from your suggestions ASAP; unfortunately I don't have the machines
available until Thursday.
 
/ Sincerely Mikael
 

________________________________
 Från: Yin Huai <huaiyin.thu@gmail.com>
Till: user@hive.apache.org; Mikael Öhman <mikael_uman@yahoo.se> 
Skickat: måndag, 16 september 2013 19:52
Ämne: Re: Duplicate rows when using group by in subquery
  


Hello Mikael,

Seems your case is related to the bug reported in https://issues.apache.org/jira/browse/HIVE-5149.
Basically, when hive uses a single MapReduce job to evaluate your query, "c.Symbol" and "c.catid"
are used to partitioning data, and thus, rows with the same value of "c.Symbol" are not correctly
grouped. If your case, only "c.Symbol" should be used if we want to use a single MapReduce
job to evaluate this query. Can you check the query plan (results of "EXPLAIN") and see if
both "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also attach your
query plan. 

This bug have been fixed in trunk. Can you test your query in trunk and see if the result
is correct. If you are using hive 0.11, you can also apply the 0.11 patch (https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch).


Thanks,

Yin



On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mikael_uman@yahoo.se> wrote:

Hello.
>
>This is basically the same question I posted on stackoverflow: http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115

>
>I know the query is a bit noisy. But this query also demonstrates the error:
>
>select a.symbol from (select symbol, ordertype from cat group by symbol, ordertype) a
group by a.symbol; 
>
>Now, this query may not make much sense but in my case, because I have 24 symbols, I expect
a result of 24 rows. But I get 48 rows back. A similar query:
>
>select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by c.Symbol,c.catid)
a group by a.Symbol;
>
>returns 864 rows, where I still expect 24 rows... If there are alternatives as to how
to write the original query in my SO post I would much appreciate hearing them. The examples
given in this mail have just been provided to demonstrate the problem using easier to understand
queries and I don't need advice on them. 
>
>The .csv data and example is from a toy example. My real setup is 6 nodes, and the table
definition is:
>
>create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE
int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE
int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL
string, REPID int) row format delimited fields terminated by ',' stored as ORC;
>set hive.exec.dynamic.partition=true;
>set hive.exec.dynamic.partition.mode=nonstrict;
>set hive.exec.max.dynamic.partitions.pernode=1000;
>
>insert...
>
>Thank you so much for any input. 
>
>/Sincerely Mikael
Mime
View raw message