hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Laiacano <a...@tumblr.com>
Subject Hive inconsistently interpreting 'where' and 'group by'
Date Wed, 30 May 2012 20:59:10 GMT
Hi all,

I have an activity log stored in an external Hive table, LZO-compressed, and partitioned by
'dt' which is the date that the data was recorded. Because of time zones and when we dump
the data into HDFS, there are about 22 hours of one day and 2 of the following in each partition.
In the example below, 95% of 

The first column is a timestamp ("ts", bigint). I want to count the number of actions on each
day within a single partition with this query:

    SELECT TO_DATE(FROM_UNIXTIME(ts)) AS day, count(*) FROM
        activity_log
        WHERE
        dt = "2012-05-29"
        GROUP BY TO_DATE(FROM_UNIXTIME(ts))
    
    
The correct results are (I confirmed these by decompressing the files locally and counting
with a simple python script):

    NULL 201
    2012-05-29 80677204
    2012-05-30 3826101
    
The NULL is caused by a relatively tiny number of corrupt rows. I sometimes get this with
the exact same query:

    NULL 201
    2012-05-29 84503305
    
It seems to convert all of the non-null timestamps to 2012-05-29. I'd say that it returns
the wrong value 1 out of every 5 times.

I've tried other things like 

   SELECT COUNT(*) FROM activity_log WHERE dt="2012-05-29" AND  ts < 1338354000 AND ts
>1338267600) 

but it seems to have the same problem with interpreting the ts value consistently.

Any tips or explanations would be greatly appreciated. I'm not sure if the corrupt rows have
something to do with it, but I still don't get why that only happens sometimes. 

Thanks,
Adam


Mime
View raw message