hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-11023) Disable directSQL if datanucleus.identifierFactory = datanucleus2
Date Thu, 18 Jun 2015 21:29:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592540#comment-14592540
] 

Sushanth Sowmyan commented on HIVE-11023:
-----------------------------------------

[~xuefuz] : Yes, this will happen to all releases with directSql - which means anything past
0.12, I think. That said, the number of installations that override this parameter should
hopefully be minimal.

If people are using datanucleus2 as their identifierFactory version, then:
a) They should disable directSql (can be done by conf parameter, does not need this code fix
- the code fix simply automates that for current and future releases)
b) They should retain that identifierFactory - a mixed metastore with both is bad.
c) Once we have a way of migrating them, as with HIVE-11039 filed, we should migrate them
out of it.

This parameter should never have been a part of hive-site.xml, I think, since it's dangerous
if a user changes it. A "datanucleus1" installation changing the parameter to "datanucleus2"
or vice-versa can result in metadata corruption for us.


> Disable directSQL if datanucleus.identifierFactory = datanucleus2
> -----------------------------------------------------------------
>
>                 Key: HIVE-11023
>                 URL: https://issues.apache.org/jira/browse/HIVE-11023
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 1.3.0, 1.2.1, 2.0.0
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>            Priority: Critical
>             Fix For: 1.2.1
>
>         Attachments: HIVE-11023.patch
>
>
> We hit an interesting bug in a case where datanucleus.identifierFactory = datanucleus2
.
> The problem is that directSql handgenerates SQL strings assuming "datanucleus1" naming
scheme. If a user has their metastore JDO managed by datanucleus.identifierFactory = datanucleus2
, the SQL strings we generate are incorrect.
> One simple example of what this results in is the following: whenever DN persists a field
which is held as a List<T>, it winds up storing each T as a separate line in the appropriate
mapping table, and has a column called INTEGER_IDX, which holds the position in the list.
Then, upon reading, it automatically reads all relevant lines with an ORDER BY INTEGER_IDX,
which results in the list retaining its order. In DN2 naming scheme, the column is called
IDX, instead of INTEGER_IDX. If the user has run appropriate metatool upgrade scripts, it
is highly likely that they have both columns, INTEGER_IDX and IDX.
> Whenever they use JDO, such as with all writes, it will then use the IDX field, and when
they do any sort of optimized reads, such as through directSQL, it will ORDER BY INTEGER_IDX.
> An immediate danger is seen when we consider that the schema of a table is stored as
a List<FieldSchema> , and while IDX has 0,1,2,3,... , INTEGER_IDX will contain 0,0,0,0,...
and thus, any attempt to describe the table or fetch schema for the table can come up mixed
up in the table's native hashing order, rather than sorted by the index.
> This can then result in schema ordering being different from the actual table. For eg:,
if a user has a (a:int,b:string,c:string), a describe on this may return (c:string, a:int,
b: string), and thus, queries which are inserting after selecting from another table can have
ClassCastExceptions when trying to insert data in the wong order - this is how we discovered
this bug. This problem, however, can be far worse, if there are no type problems - it is possible,
for eg., that if a,b&c were all strings, that that insert query would succeed but mix
up the order, which then results in user table data being mixed up. This has the potential
to be very bad.
> We should write a tool to help convert metastores that use "datanucleus2" to "datanucleus1"(more
difficult, needs more one-time testing) or change directSql to support both(easier to code,
but increases test-coverage matrix significantly and we should really then be testing against
both schemes). But in the short term, we should disable directSql if we see that the identifierfactory
is "datanucleus2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message