hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5304) Hive results can depend on metastore's underlying datastore
Date Mon, 11 Nov 2013 19:55:17 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819296#comment-13819296
] 

Sergey Shelukhin commented on HIVE-5304:
----------------------------------------

[~ashutoshc] do you think we should document this for people who create databases?

> Hive results can depend on metastore's underlying datastore
> -----------------------------------------------------------
>
>                 Key: HIVE-5304
>                 URL: https://issues.apache.org/jira/browse/HIVE-5304
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>            Reporter: Sergey Shelukhin
>
> [removed old description]
> Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 'SomeString'
to underlying SQL datastore. However, the datastore may handle these differently based on
the encoding and collation used for the columns of the database.
> So, query results can change depending on the underlying store for the metastore and
its version.
> drop_partitions_filter.q test illustrates this problem. In byte order collation (proper
way) USA is sorted before Uganda, but some collations may do it the other way, causing the
test to fail.
> I am assuming that byte-order sort if the correct way to order things.
> Our MySQL script specifies _bin collation, which is byte-order; Postgres 9.1 and after,
as far as I see, defaults to "C" collation, which is also byte-order.
> Derby seems to use byte-order by default, I didn't spend a lot of time on Derby.
> However, Postgres before 9.1 seems to default to "en_US.UTF8" and there's no way to change
column collation in our script if database is already created.
> MySQL by default doesn't use _bin collation (on my machine), so if database is auto-created,
the order of things is going to change. 
> I didn't investigate MSSQL or Oracle.
> For now it seems that:
> 1) Auto-create shouldn't be used.
> 2) If old version of postgres (<9.1) is used, the collation should be set properly
by whoever issues "create database" (that is not our script).
> 3) We might want to add 'collate "C"' to varchar columns in the postgres script to ensure
the correct collation; however, this will break the script for postgres <9.1.
> 4) MSSQL and Oracle might warrant investigation.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message