hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szehon Ho (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HIVE-5774) INSERT OVERWRITE DYNAMIC PARTITION on LARGE DATA
Date Fri, 20 Dec 2013 01:02:32 GMT

     [ https://issues.apache.org/jira/browse/HIVE-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Szehon Ho reassigned HIVE-5774:
-------------------------------

    Assignee: Szehon Ho

> INSERT OVERWRITE DYNAMIC PARTITION on LARGE DATA
> ------------------------------------------------
>
>                 Key: HIVE-5774
>                 URL: https://issues.apache.org/jira/browse/HIVE-5774
>             Project: Hive
>          Issue Type: Bug
>          Components: Database/Schema
>         Environment: debian 6.0.7
>            Reporter: Danny Teok
>            Assignee: Szehon Ho
>            Priority: Critical
>              Labels: dynamic, hive, insert, overwrite, partition
>
> After several forensic analysis, we are convinced that there is a bug when rebuilding
using dynamic partition over more than 30 days. Row counts do not match.
> In details:
> Part A -- original_table
> 2013-01-01; 394,755 rows
> 2013-01-02; 424,448
> 2013-01-03; 427,201
> ...
> 2013-10-30; 3,234,472
> Part B -- copy_of_original_table_new
> 2013-01-01; 372,628 rows
> 2013-01-02; 400,553
> 2013-01-03; 403,495
> ...
> 2013-10-30; 2,865,877
> The query that is used to populate the original table is the same for populating the
"copy_of_original_table_new" table. When we rebuilt for 1 day, e.g. 2013-01-01, the number
of row counts of the copy_of_original_table_new  matched up exactly with orignal_table.
> When we rebuilt for 7 days, the number of row counts matched up exactly.
> When we rebuilt for 15 days, the number of row counts matched up exactly.
> When we rebuilt for 303 days (10 months), everything fxxked up. No matches.
> When we rebuilt for 35 days, 80% matched up exactly. The other 20% are out from hundreds
to tens of thousands of rows (a variance of up to 3%)
> In other words, the more days that are specified in the WHERE dt BETWEEN dateStart AND
dateEnd, the dates will be out, i.e. no matching row count with original_table.
> However, of those 20% that are out, we rebuilt each of them statically with the corresponding
date. The result is astonishingly surprising -- they matched the original_table row count!
> Apologize in advance if this is not technical enough, but I hope the message is clear.
We believe there is a bug. Not sure how to check our Hive version, but our Hadoop's version
is "Hadoop 2.0.0-cdh4.1.1"
> For a glimpse of the INSERT OVERWRITE sql, it's here -- http://pastebin.com/g1qxsUm2



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message