hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Main (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-7578) Sorting and max() against numeric partition column does not work
Date Thu, 31 Jul 2014 21:51:40 GMT

     [ https://issues.apache.org/jira/browse/HIVE-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Main updated HIVE-7578:
-----------------------------

    Description: 
Hive does not return the correct results when you run "max()" on a table that has been partitioned
on a numeric column when more than 10 partitions are present and they are numbered 1 - 10.
 The same thing happens if the numbers are in the range of 1 - 100.  It appears as if Hive
is using a string-based sort algorithm instead of honoring the numeric data type and applying
a true numeric sort.

The steps to recreate this issue are included below.

First, put the file below into HDFS at '/tmp/hive_bug/people_flat/people.dat'.  
(Leave lines with leading dashes out)

----- File contents -----
1,David
2,Jeff
3,Cindy
4,Prakash
5,Kate
6,Chung
7,Ginny
8,Huy
9,Brett
10,Jennifer
11,Dan
12,Shivani
13,Nate
----- EOF -----

Now run these statements in either Hive or beeline:

drop table if exists people_flat;

create external table people_flat (
    pnum int, 
    pname string
)
row format delimited fields terminated by ','
location '/tmp/hive_bug/people_flat';

select * from people_flat order by plum;
-- order of records should be correct

select max(pnum) from people_flat;
-- 13 (this is correct)

create table people_partitioned (
    pname string
)
partitioned by (pnum int)
row format delimited fields terminated by ',';

SET hive.exec.max.dynamic.partitions=10000;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET hive.exec.dynamic.partition.mode=nonstrict;

insert into table people_partitioned
partition (pnum)
select pname, 
    pnum
from people_flat;

select max(pnum) from people_partitioned;
-- result 9 is wrong!

select * from people_partitioned order by pnum;
-- records not sorted correctly in numeric order

-- Although this is not a solution, the next query yield correct results

select max(pnum_int)
from (
    select cast(pnum as int) pnum_int
    from people_partitioned
) a;



  was:
Hive does not return the correct results when you run "max()" on a table that has been partitioned
on a numeric column when more than 10 partitions are present and they are numbered 1 - 10.
 The same thing happens if the numbers are in the range of 1 - 100.

The steps to recreate this issue are included below.

First, put the file below into HDFS at '/tmp/hive_bug/people_flat/people.dat'.  
(Leave lines with leading dashes out)

----- File contents -----
1,David
2,Jeff
3,Cindy
4,Prakash
5,Kate
6,Chung
7,Ginny
8,Huy
9,Brett
10,Jennifer
11,Dan
12,Shivani
13,Nate
----- EOF -----

Now run these statements in either Hive or beeline:

drop table if exists people_flat;

create external table people_flat (
    pnum int, 
    pname string
)
row format delimited fields terminated by ','
location '/tmp/hive_bug/people_flat';

select * from people_flat order by plum;
-- order of records should be correct

select max(pnum) from people_flat;
-- 13 (this is correct)

create table people_partitioned (
    pname string
)
partitioned by (pnum int)
row format delimited fields terminated by ',';

SET hive.exec.max.dynamic.partitions=10000;
SET hive.exec.max.dynamic.partitions.pernode=10000;
SET hive.exec.dynamic.partition.mode=nonstrict;

insert into table people_partitioned
partition (pnum)
select pname, 
    pnum
from people_flat;

select max(pnum) from people_partitioned;
-- result 9 is wrong!

select * from people_partitioned order by pnum;
-- records not sorted correctly in numeric order

-- Although this is not a solution, the next query yield correct results

select max(pnum_int)
from (
    select cast(pnum as int) pnum_int
    from people_partitioned
) a;




> Sorting and max() against numeric partition column does not work
> ----------------------------------------------------------------
>
>                 Key: HIVE-7578
>                 URL: https://issues.apache.org/jira/browse/HIVE-7578
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.10.0, 0.12.0
>         Environment: Tested in a linux environment
>            Reporter: David Main
>
> Hive does not return the correct results when you run "max()" on a table that has been
partitioned on a numeric column when more than 10 partitions are present and they are numbered
1 - 10.  The same thing happens if the numbers are in the range of 1 - 100.  It appears as
if Hive is using a string-based sort algorithm instead of honoring the numeric data type and
applying a true numeric sort.
> The steps to recreate this issue are included below.
> First, put the file below into HDFS at '/tmp/hive_bug/people_flat/people.dat'.  
> (Leave lines with leading dashes out)
> ----- File contents -----
> 1,David
> 2,Jeff
> 3,Cindy
> 4,Prakash
> 5,Kate
> 6,Chung
> 7,Ginny
> 8,Huy
> 9,Brett
> 10,Jennifer
> 11,Dan
> 12,Shivani
> 13,Nate
> ----- EOF -----
> Now run these statements in either Hive or beeline:
> drop table if exists people_flat;
> create external table people_flat (
>     pnum int, 
>     pname string
> )
> row format delimited fields terminated by ','
> location '/tmp/hive_bug/people_flat';
> select * from people_flat order by plum;
> -- order of records should be correct
> select max(pnum) from people_flat;
> -- 13 (this is correct)
> create table people_partitioned (
>     pname string
> )
> partitioned by (pnum int)
> row format delimited fields terminated by ',';
> SET hive.exec.max.dynamic.partitions=10000;
> SET hive.exec.max.dynamic.partitions.pernode=10000;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> insert into table people_partitioned
> partition (pnum)
> select pname, 
>     pnum
> from people_flat;
> select max(pnum) from people_partitioned;
> -- result 9 is wrong!
> select * from people_partitioned order by pnum;
> -- records not sorted correctly in numeric order
> -- Although this is not a solution, the next query yield correct results
> select max(pnum_int)
> from (
>     select cast(pnum as int) pnum_int
>     from people_partitioned
> ) a;



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message