hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Fouché (JIRA) <j...@apache.org>
Subject [jira] [Updated] (HIVE-3699) Multiple insert overwrite into multiple tables query stores same results in all tables
Date Sat, 10 Nov 2012 09:47:12 GMT

     [ https://issues.apache.org/jira/browse/HIVE-3699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Fouché updated HIVE-3699:
-----------------------------------

    Description: 
(Note: This might be related to HIVE-2750)

I am doing a query with multiple INSERT OVERWRITE to multiple tables in order to scan the
dataset only 1 time, and i end up having all these tables with the same content ! It seems
the GROUP BY query that returns results is overwriting all the temp tables.

Weird enough, if i had further GROUP BY queries into additional temp tables, grouped by a
different field, then all temp tables, even the ones that would have been wrong content are
all correctly populated.

This is the misbehaving query:

    FROM nikon
    INSERT OVERWRITE TABLE e1
    SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Impressions
    WHERE qs_cs_s_cat='PRINT' GROUP BY qs_cs_s_aid
    INSERT OVERWRITE TABLE e2
    SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Vues
    WHERE qs_cs_s_cat='VIEW' GROUP BY qs_cs_s_aid
    ;

It launches only one MR job and here are the results. Why does table 'e1' contains results
from table 'e2' ?! Table 'e1' should have been empty (see individual SELECTs further below)

    hive> SELECT * from e1;
    OK
    NULL    2
    1627575 25
    1627576 70
    1690950 22
    1690952 42
    1696705 199
    1696706 66
    1696730 229
    1696759 85
    1696893 218
    Time taken: 0.229 seconds

    hive> SELECT * from e2;
    OK
    NULL    2
    1627575 25
    1627576 70
    1690950 22
    1690952 42
    1696705 199
    1696706 66
    1696730 229
    1696759 85
    1696893 218
    Time taken: 0.11 seconds


Here is are the result to the indiviual queries (only the second query returns a result set):

    hive> SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Impressions FROM nikon
    WHERE qs_cs_s_cat='PRINT' GROUP BY qs_cs_s_aid;
    (...)
    OK
          <- There are no results, this is normal
    Time taken: 41.471 seconds

    hive> SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Vues FROM nikon
    WHERE qs_cs_s_cat='VIEW' GROUP BY qs_cs_s_aid;
    (...)
    OK
    NULL  2
    1627575 25
    1627576 70
    1690950 22
    1690952 42
    1696705 199
    1696706 66
    1696730 229
    1696759 85
    1696893 218
    Time taken: 39.607 seconds
    

  was:
I am doing a query with multiple INSERT OVERWRITE to multiple tables in order to scan the
dataset only 1 time, and i end up having all these tables with the same content ! It seems
the GROUP BY query that returns results is overwriting all the temp tables.

Weird enough, if i had further GROUP BY queries into additional temp tables, grouped by a
different field, then all temp tables, even the ones that would have been wrong content are
all correctly populated.

This is the misbehaving query:

    FROM nikon
    INSERT OVERWRITE TABLE e1
    SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Impressions
    WHERE qs_cs_s_cat='PRINT' GROUP BY qs_cs_s_aid
    INSERT OVERWRITE TABLE e2
    SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Vues
    WHERE qs_cs_s_cat='VIEW' GROUP BY qs_cs_s_aid
    ;

It launches only one MR job and here are the results. Why does table 'e1' contains results
from table 'e2' ?! Table 'e1' should have been empty (see individual SELECTs further below)

    hive> SELECT * from e1;
    OK
    NULL    2
    1627575 25
    1627576 70
    1690950 22
    1690952 42
    1696705 199
    1696706 66
    1696730 229
    1696759 85
    1696893 218
    Time taken: 0.229 seconds

    hive> SELECT * from e2;
    OK
    NULL    2
    1627575 25
    1627576 70
    1690950 22
    1690952 42
    1696705 199
    1696706 66
    1696730 229
    1696759 85
    1696893 218
    Time taken: 0.11 seconds


Here is are the result to the indiviual queries (only the second query returns a result set):

    hive> SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Impressions FROM nikon
    WHERE qs_cs_s_cat='PRINT' GROUP BY qs_cs_s_aid;
    (...)
    OK
          <- There are no results, this is normal
    Time taken: 41.471 seconds

    hive> SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Vues FROM nikon
    WHERE qs_cs_s_cat='VIEW' GROUP BY qs_cs_s_aid;
    (...)
    OK
    NULL  2
    1627575 25
    1627576 70
    1690950 22
    1690952 42
    1696705 199
    1696706 66
    1696730 229
    1696759 85
    1696893 218
    Time taken: 39.607 seconds
    

    
> Multiple insert overwrite into multiple tables query stores same results in all tables
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-3699
>                 URL: https://issues.apache.org/jira/browse/HIVE-3699
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.9.0
>         Environment: Cloudera 4.1 on Amazon Linux (rebranded Centos 6): hive-0.9.0+150-1.cdh4.1.1.p0.4.el6.noarch
>            Reporter: Alexandre Fouché
>
> (Note: This might be related to HIVE-2750)
> I am doing a query with multiple INSERT OVERWRITE to multiple tables in order to scan
the dataset only 1 time, and i end up having all these tables with the same content ! It seems
the GROUP BY query that returns results is overwriting all the temp tables.
> Weird enough, if i had further GROUP BY queries into additional temp tables, grouped
by a different field, then all temp tables, even the ones that would have been wrong content
are all correctly populated.
> This is the misbehaving query:
>     FROM nikon
>     INSERT OVERWRITE TABLE e1
>     SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Impressions
>     WHERE qs_cs_s_cat='PRINT' GROUP BY qs_cs_s_aid
>     INSERT OVERWRITE TABLE e2
>     SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Vues
>     WHERE qs_cs_s_cat='VIEW' GROUP BY qs_cs_s_aid
>     ;
> It launches only one MR job and here are the results. Why does table 'e1' contains results
from table 'e2' ?! Table 'e1' should have been empty (see individual SELECTs further below)
>     hive> SELECT * from e1;
>     OK
>     NULL    2
>     1627575 25
>     1627576 70
>     1690950 22
>     1690952 42
>     1696705 199
>     1696706 66
>     1696730 229
>     1696759 85
>     1696893 218
>     Time taken: 0.229 seconds
>     hive> SELECT * from e2;
>     OK
>     NULL    2
>     1627575 25
>     1627576 70
>     1690950 22
>     1690952 42
>     1696705 199
>     1696706 66
>     1696730 229
>     1696759 85
>     1696893 218
>     Time taken: 0.11 seconds
> Here is are the result to the indiviual queries (only the second query returns a result
set):
>     hive> SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Impressions FROM nikon
>     WHERE qs_cs_s_cat='PRINT' GROUP BY qs_cs_s_aid;
>     (...)
>     OK
>           <- There are no results, this is normal
>     Time taken: 41.471 seconds
>     hive> SELECT qs_cs_s_aid AS Emplacements, COUNT(*) AS Vues FROM nikon
>     WHERE qs_cs_s_cat='VIEW' GROUP BY qs_cs_s_aid;
>     (...)
>     OK
>     NULL  2
>     1627575 25
>     1627576 70
>     1690950 22
>     1690952 42
>     1696705 199
>     1696706 66
>     1696730 229
>     1696759 85
>     1696893 218
>     Time taken: 39.607 seconds
>     

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message