hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-11039) Write a tool to allow people with datanucelus.identifierFactory=datanucleus2 to migrate their metastore to datanucleus1 naming
Date Wed, 17 Jun 2015 23:37:01 GMT
Sushanth Sowmyan created HIVE-11039:
---------------------------------------

             Summary: Write a tool to allow people with datanucelus.identifierFactory=datanucleus2
to migrate their metastore to datanucleus1 naming
                 Key: HIVE-11039
                 URL: https://issues.apache.org/jira/browse/HIVE-11039
             Project: Hive
          Issue Type: Bug
          Components: Metastore
    Affects Versions: 1.3.0, 1.2.1, 2.0.0
            Reporter: Sushanth Sowmyan
            Assignee: Sushanth Sowmyan
            Priority: Critical


We hit an interesting bug in a case where datanucleus.identifierFactory = datanucleus2 .

The problem is that directSql handgenerates SQL strings assuming "datanucleus1" naming scheme.
If a user has their metastore JDO managed by datanucleus.identifierFactory = datanucleus2
, the SQL strings we generate are incorrect.

One simple example of what this results in is the following: whenever DN persists a field
which is held as a List<T>, it winds up storing each T as a separate line in the appropriate
mapping table, and has a column called INTEGER_IDX, which holds the position in the list.
Then, upon reading, it automatically reads all relevant lines with an ORDER BY INTEGER_IDX,
which results in the list retaining its order. In DN2 naming scheme, the column is called
IDX, instead of INTEGER_IDX. If the user has run appropriate metatool upgrade scripts, it
is highly likely that they have both columns, INTEGER_IDX and IDX.

Whenever they use JDO, such as with all writes, it will then use the IDX field, and when they
do any sort of optimized reads, such as through directSQL, it will ORDER BY INTEGER_IDX.

An immediate danger is seen when we consider that the schema of a table is stored as a List<FieldSchema>
, and while IDX has 0,1,2,3,... , INTEGER_IDX will contain 0,0,0,0,... and thus, any attempt
to describe the table or fetch schema for the table can come up mixed up in the table's native
hashing order, rather than sorted by the index.

This can then result in schema ordering being different from the actual table. For eg:, if
a user has a (a:int,b:string,c:string), a describe on this may return (c:string, a:int, b:
string), and thus, queries which are inserting after selecting from another table can have
ClassCastExceptions when trying to insert data in the wong order - this is how we discovered
this bug. This problem, however, can be far worse, if there are no type problems - it is possible,
for eg., that if a,b&c were all strings, that that insert query would succeed but mix
up the order, which then results in user table data being mixed up. This has the potential
to be very bad.

We should write a tool to help convert metastores that use "datanucleus2" to "datanucleus1"(more
difficult, needs more one-time testing) or change directSql to support both(easier to code,
but increases test-coverage matrix significantly and we should really then be testing against
both schemes). But in the short term, we should disable directSql if we see that the identifierfactory
is "datanucleus2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message